Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python scrapy:从xpath选择器中删除一些元素_Python_Html_Xpath_Scrapy - Fatal编程技术网

Python scrapy:从xpath选择器中删除一些元素

Python scrapy:从xpath选择器中删除一些元素,python,html,xpath,scrapy,Python,Html,Xpath,Scrapy,我正在使用Scrapy来抓取具有一些奇怪格式约定的站点。基本思想是,除了中间的几个div之外,我想要某个div的所有文本和子元素。下面是一段代码:- <div align="center" class="article"><!--wanted--> <img src="http://i.imgur.com/12345.jpg" width="500" alt="abcde" title="abcde"><br><br>

我正在使用Scrapy来抓取具有一些奇怪格式约定的站点。基本思想是,除了中间的几个div之外,我想要某个div的所有文本和子元素。下面是一段代码:-

<div align="center" class="article"><!--wanted-->
    <img src="http://i.imgur.com/12345.jpg" width="500" alt="abcde" title="abcde"><br><br>     
    <div style="text-align:justify"><!--wanted-->
        Sample Text<br><br>Demo: <a href="http://www.example.com/?http://example.com/item/asash/asdas-asfasf-afaf.html" target="_blank">http://example.com/dfa/asfa/aasfa</a><br><br>
        <div class="quote"><!--wanted-->
            http://www.coolfiles.ro/download/kleo13.rar/1098750<br>http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links<br>
        </div>
        <br>
        <div align="left"><!--not wanted-->
            <div id="ratig-layer-2249"><!--not wanted-->
                <div class="rating"><!--not wanted-->
                    <ul class="unit-rating">
                        <li class="current-rating" style="width:80%;">80</li>
                        <li><a href="#" title="Bad" class="r1-unit" onclick="doRate('1', '2249'); return false;">1</a></li>
                        <li><a href="#" title="Poor" class="r2-unit" onclick="doRate('2', '2249'); return false;">2</a></li>
                        <li><a href="#" title="Fair" class="r3-unit" onclick="doRate('3', '2249'); return false;">3</a></li>
                        <li><a href="#" title="Good" class="r4-unit" onclick="doRate('4', '2249'); return false;">4</a></li>
                        <li><a href="#" title="Excellent" class="r5-unit" onclick="doRate('5', '2249'); return false;">5</a></li>
                    </ul>
                </div>
                (votes: <span id="vote-num-id-2249">3</span>)
            </div>
        </div>
        <div class="reln"><!--not wanted-->
            <strong>
                <h4>Related News:</h4>
            </strong>
            <li><a href="http://www.example.com/themes/tf/a-b-c-d.html">1</a></li>
            <li><a href="http://www.example.com/plugins/codecanyon/a-b-c-d">2</a></li>
            <li><a href="http://www.example.com/themes/tf/a-b-c-d.html">3</a></li>
            <li><a href="http://www.example.com/plugins/codecanyon/a-b-c-d.html">4</a></li>
            <li><a href="http://www.example.com/plugins/codecanyon/a-b-c-d.html">5</a></li>
        </div>
    </div>
</div>
以下是我尝试过但没有得到预期结果的xpath:-

item['article_html'] = hxs.select("//div[@class='article']").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln']) and not(@class='reln')]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='reln']/preceding-sibling::node()[preceding-sibling::div[@class='quote']]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='quote']/*[not(self::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/*[(self::name()='reln'])]").extract()[0]

提前感谢……

对于Scrapy,您似乎无法做到这一点。 我有自己的功能来删除特定节点(及其子节点):


希望它能有所帮助

XPath不是这样工作的。可以使用XSLT模板,也可以只在
div.article>div
中选择所需的路径,将它们连接起来,并使用
div.article>div
包装整个字符串。我认为连接和包装整个字符串很有用。这将是伟大的,如果你能提供我的上述与你的想法剪贴代码编辑。我不能照你说的做,因为我是新来的。谢谢。在我的场景中有一个问题,但我不知道如何实现。你试过了吗?您遇到的错误是什么?因此,我们不想抛开你的工作,我们想帮助你回答一些有趣的问题。你的问题似乎在链接中得到了回答。我建议您尝试自己实现它,如果您失败,请发布您尝试的内容。我添加了一些Xpath,我已经尝试过,但无法获得预期的结果。这并不是我不想学习,我无法理解我的评论中提到的解决方案,因为它没有正确缩进。谢谢你的评论@谢谢你。该函数的核心工作。但是这里有一个令人困惑的名字
nodeToRemove
,它是列表吗?如果没有,那么就没有迭代的方法。如果它是选择器,那么它在.css()中是如何工作的?使用循环索引[0]可以更好地进行优化。但似乎在删除节点后也会删除必要的文本。并且没有
strip\u元素(带有\u tail=False)
来解决这个问题,它有lxml解析器
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem


class IsBullshitSpider(CrawlSpider):
    """ General configuration of the Crawl Spider """
    name = 'isbullshitwp'
    start_urls = ['http://example.com/themes'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
        # r'page/\d+' : regular expression for http://example.com/page/X URLs
        Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
        # r'\d{4}/\d{2}/\w+' : regular expression for http://example.com/YYYY/MM/title URLs

    def parse_blogpost(self, response):
        hxs = HtmlXPathSelector(response)
        item = IsBullshitItem()
        item['title'] = hxs.select('//span[@class="storytitle"]/text()').extract()[0]
        item['article_html'] = hxs.select("//div[@class='article']").extract()[0]

        return item
item['article_html'] = hxs.select("//div[@class='article']").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln']) and not(@class='reln')]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='reln']/preceding-sibling::node()[preceding-sibling::div[@class='quote']]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='quote']/*[not(self::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/*[(self::name()='reln'])]").extract()[0]
def removeNode(context, nodeToRemove):

for element in nodeToRemove:
    contentToRemove = context.css(element)

    if contentToRemove:
        contentToRemove = contentToRemove[0].root
        contentToRemove.getparent().remove(contentToRemove)

return context.extract()