Scrapy CrawlSpider-开始URL的错误返回
我正在将Scrapy CrawlSpider-开始URL的错误返回,scrapy,Scrapy,我正在将CrawlSpider与具有errback的规则链接抽取器一起使用 我正在使用parse_start_url来解析start_url,但是我也需要errback来解析它们 class CS(CrawlSpider): name = "CS" rules = (Rule(LinkExtractor(allow=[], deny=[]), follow=True, callback='my_parse', errback='my_errback'),)
CrawlSpider
与具有errback
的规则链接抽取器一起使用
我正在使用parse_start_url
来解析start_url
,但是我也需要errback
来解析它们
class CS(CrawlSpider):
name = "CS"
rules = (Rule(LinkExtractor(allow=[], deny=[]), follow=True, callback='my_parse', errback='my_errback'),)
custom_settings = {
'DEPTH_LIMIT': 3,
#etc
}
start_urls = ['url']
allowed_domains = ['domain']
def my_errback(self, failure):
# log all failures
def parse_start_url(self, response):
return self.my_parse(response)
def my_parse(self, response):
# parse responses
我面临的问题是,只为提取的链接而不是开始URL调用errback
我不能使用start_requests方法(如下所示),因为我使用的是爬行器和规则。当我这样做时,只会刮取开始URL:
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback=self.my_parse,
errback=self.my_errback)
答案是:
基本上,我们必须使用start\u请求
,但要删除回调参数。这样将调用默认的self.parse,并遵守规则
class CS(CrawlSpider):
name = "CS"
rules = (Rule(LinkExtractor(allow=[], deny=[]), follow=True, callback='my_parse', errback='my_errback'),)
custom_settings = {
'DEPTH_LIMIT': 3,
#etc
}
start_urls = ['url']
allowed_domains = ['domain']
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, errback=self.my_errback)
def my_errback(self, failure):
# log all failures
def parse_start_url(self, response):
return self.my_parse(response)
def my_parse(self, response):
# parse responses