Scrapy CrawlSpider-开始URL的错误返回

Scrapy CrawlSpider-开始URL的错误返回,scrapy,Scrapy,我正在将CrawlSpider与具有errback的规则链接抽取器一起使用 我正在使用parse_start_url来解析start_url,但是我也需要errback来解析它们 class CS(CrawlSpider): name = "CS" rules = (Rule(LinkExtractor(allow=[], deny=[]), follow=True, callback='my_parse', errback='my_errback'),)

我正在将
CrawlSpider
与具有
errback
的规则链接抽取器一起使用

我正在使用
parse_start_url
来解析
start_url
,但是我也需要
errback
来解析它们

class CS(CrawlSpider):
    name = "CS"
    rules = (Rule(LinkExtractor(allow=[], deny=[]), follow=True, callback='my_parse', errback='my_errback'),)           
    custom_settings = {
        'DEPTH_LIMIT': 3,
        #etc
    }
    
    start_urls = ['url']
    allowed_domains = ['domain']
    
    def my_errback(self, failure):
        # log all failures
    
    def parse_start_url(self, response):
        return self.my_parse(response)

    def my_parse(self, response):
        # parse responses
我面临的问题是,只为提取的链接而不是开始URL调用errback

我不能使用start_requests方法(如下所示),因为我使用的是爬行器和规则。当我这样做时,只会刮取开始URL:

def start_requests(self):
            for u in self.start_urls:
                yield scrapy.Request(u, callback=self.my_parse,
                                        errback=self.my_errback)
答案是:

基本上,我们必须使用
start\u请求
,但要删除回调参数。这样将调用默认的self.parse,并遵守规则

class CS(CrawlSpider):
    name = "CS"
    rules = (Rule(LinkExtractor(allow=[], deny=[]), follow=True, callback='my_parse', errback='my_errback'),)           
    custom_settings = {
        'DEPTH_LIMIT': 3,
        #etc
    }
    
    start_urls = ['url']
    allowed_domains = ['domain']

    def start_requests(self):
        for u in self.start_urls:
            yield scrapy.Request(u, errback=self.my_errback)
    
    def my_errback(self, failure):
        # log all failures
    
    def parse_start_url(self, response):
        return self.my_parse(response)

    def my_parse(self, response):
        # parse responses