Python restrict\u xpaths参数不过滤爬网数据_Python_Xpath_Scrapy_Web Crawler_Scrapy Spider

Python restrict\u xpaths参数不过滤爬网数据

python xpath scrapy web-crawler

Python restrict\u xpaths参数不过滤爬网数据,python,xpath,scrapy,web-crawler,scrapy-spider,Python,Xpath,Scrapy,Web Crawler,Scrapy Spider,我正在使用Scrapy1.0.5，并试图抓取一系列文章以获取它们的标题和相应的URL。我只想抓取ID为devBody的div元素中的链接。考虑到这一点，我试图在规则中指定这样一个限制，但我不明白为什么它仍然在该范围之外爬行链接： from scrapy import Spider from scrapy.linkextractors import LinkExtractor from scrapy.spiders import Rule from stack.items import Stack

我正在使用Scrapy1.0.5，并试图抓取一系列文章以获取它们的标题和相应的URL。我只想抓取ID为

devBody

的

div

元素中的链接。考虑到这一点，我试图在规则中指定这样一个限制，但我不明白为什么它仍然在该范围之外爬行链接：

from scrapy import Spider
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import Rule
from stack.items import StackItem

class StackSpider(Spider):
    name = "stack"
    allowed_domains = ["dev.mysql.com"]
    start_urls = ["http://dev.mysql.com/tech-resources/articles/"]

    rules = (Rule(LinkExtractor(restrict_xpaths='//div[@id="devBody"]',), callback='parse'),)

    def parse(self, response):
        entries = response.xpath('//h4')
        items = []
        //using a counter here feels lame but I really couldn't think of a better 
        //way to avoid getting a list of all URLs and titles wrapped into a single object
        i = 0            
        for entry in entries:
            item = StackItem()
            item['title'] = entry.xpath('//a/text()').extract()[i]
            item['url'] = entry.xpath('//a/@href').extract()[i]
            yield item
            items.append(item)
            i += 1

在试图理解这种行为时，我使用了Chrome开发工具来使用XPath查询元素，一切都按它的方式进行。然而，当我（尝试）在代码中放入相同的步骤序列时，事情就不一样了。它在

div

之外获取数据，该div最后是给定文章的URL。它确实说它拿到了57张通缉令，但在这一过程中出现了一些问题

我不知道我做错了什么。任何帮助都将不胜感激。

您需要将

StackSpider

类建立在

crawdspider

类的基础上，该类具有

rules

属性。看。您需要重命名parse（）方法并更改回调，因为爬行蜘蛛有自己的parse（），如文档中所述

或者B计划

爬行蜘蛛不会给抓取这一页增加太多内容。使用普通的爬行器并在“h4/a”组合上循环以获取所需信息非常简单。试试这个

for row in response.xpath('//div[@id="devBody"]/h4'):
    item['title'] = row.xpath('a/text()').extract()
    # get the full url
    item['url'] = response.urljoin(row.xpath('a/@href').extract_first())
    yield item

谢谢你的提醒，我已经读了这篇文章和其他几篇文章，但我甚至没有想到要检查这些细节。我做了你提到的调整，但现在它似乎过滤掉了我想要取回的物品。”这是我当前的代码和行为。@w00t我已经用另一个建议编辑了我的答案，这会让你得到你想要的。