Python 在爬行爬行器中的LinkExtractor中将follow设置为true的目的是什么？_Python_Web Scraping_Scrapy

Python 在爬行爬行器中的LinkExtractor中将follow设置为true的目的是什么？

python web-scraping scrapy

Python 在爬行爬行器中的LinkExtractor中将follow设置为true的目的是什么？,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我在文档中看到了爬行蜘蛛的示例代码： import scrapy from scrapy.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.e

我在文档中看到了爬行蜘蛛的示例代码：

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class MySpider(CrawlSpider):
    name = 'example.com'
    allowed_domains = ['example.com']
    start_urls = ['http://www.example.com']

    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('item\.php', )), callback='parse_item'),
    )

    def parse_item(self, response):
        self.logger.info('Hi, this is an item page! %s', response.url)
        item = scrapy.Item()
        item['id'] = response.xpath('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
        item['name'] = response.xpath('//td[@id="item_name"]/text()').extract()
        item['description'] = response.xpath('//td[@id="item_description"]/text()').extract()
        return item

据我所知，这些步骤将发生：

上面的Scrapy Spider（

MySpider

）将从Scrapy引擎获得

的响应http://www.example.com“

链接（位于

开始url

列表中）。然后，

LinkExtractor

将根据上面提供的两条规则从该响应中提取所有链接

现在让我们假设第二个

LinkExtractor

（带有回调）获得了3个链接（

）http://www.example.com/item1.php','http://www.example.com/item2.php','http://www.example.com/item3.php“

）和第一个没有回调的

LinkedExtractor

得到了1个链接（

www.example.com/category1.php

）

对于上面的3个链接，只需调用指定的回调，

parse_item

。但是，对于这一个链接（

www.example.com/category1.php

），由于没有与之相关联的回调，会发生什么？这两个

linkextractor

会再次在这一个链接上运行吗？这个假设正确吗

# Extract links matching 'category.php' (but not matching 'subsection.php')
# and follow links from them (since no callback means follow=True by default).

由于

规则

对象没有

回调

参数，

follow

参数设置为

True

因此，在您的示例中，将对1个链接进行爬网，并从中提取链接，就像对第一个页面进行爬网一样，这将继续进行，直到第一个规则提取的链接不再存在或所有链接都已被访问

由于

规则

对象没有

回调

参数，

follow

参数设置为

True

哦，好的，我现在明白了。那么本质上这两个

linkextractor

将再次从这1个链接生成的响应中提取正确的链接？设置

follow=True

时是否有放置回调的必要？不，没有必要提供回调来跟踪链接，因为您不想手动解析它们。这样想，

follow=True

意味着它将回调到一个隐藏的回调，该回调只在响应时调用所有规则，而不做任何其他操作。您在示例中声明了

，因此将对1个链接进行爬网，并从中提取链接

。当你说一个链接将被爬网时，你的基本意思是它将被基于

LinkExtractors

的链接爬网，对吗？@SigorEzz-yup！好的，谢谢！但从技术上讲，我可能需要从1链接的响应中解析出一些项目。例如，在从该1链接获得响应后，我可能会根据其回调解析出一些项，然后正如您所述，让我的其他两个LinkExtractor针对来自该1链接的响应运行。我真的不明白你说的

你不想手动解析它们是什么意思

？非常抱歉，问题太多，占用了你太多的时间。哦，好的，我现在明白了。那么本质上这两个

linkextractor

将再次从这1个链接生成的响应中提取正确的链接？设置

follow=True

时是否有放置回调的必要？不，没有必要提供回调来跟踪链接，因为您不想手动解析它们。这样想，

follow=True

意味着它将回调到一个隐藏的回调，该回调只在响应时调用所有规则，而不做任何其他操作。您在示例中声明了

，因此将对1个链接进行爬网，并从中提取链接

。当你说一个链接将被爬网时，你的基本意思是它将被基于

LinkExtractors

你不想手动解析它们是什么意思

？非常抱歉，问题太多，占用了您太多的时间。