Scrapy 为我的痒蜘蛛写了一篇回帖,但回溯也一直在发生,为什么?
我使用的是Scrapy 1.1,我在脚本中调用Scrapy。我的spider启动方法如下所示:Scrapy 为我的痒蜘蛛写了一篇回帖,但回溯也一直在发生,为什么?,scrapy,python-3.5,Scrapy,Python 3.5,我使用的是Scrapy 1.1,我在脚本中调用Scrapy。我的spider启动方法如下所示: def run_spider(self): runner = CrawlerProcess(get_project_settings()) spider = SiteSpider() configure_logging() d = runner.crawl(spider, websites_file=self.raw_data_file) d.addBoth(l
def run_spider(self):
runner = CrawlerProcess(get_project_settings())
spider = SiteSpider()
configure_logging()
d = runner.crawl(spider, websites_file=self.raw_data_file)
d.addBoth(lambda _: reactor.stop())
reactor.run()
这里是我的spider的摘录,其中有一个errback,如文档中所示,但它仅在捕获到故障时打印
class SiteSpider(scrapy.Spider):
name = 'SiteCrawler'
custom_settings = {
'FEED_FORMAT': 'json',
'FEED_URI': 'result.json',
}
def __init__(self, websites_file=None, *args, **kwargs):
super().__init__(*args, **kwargs)
self.websites_file = websites_file
print('***********')
print(self.websites_file)
def start_requests(self):
.....
if is_valid_url(website_url):
yield scrapy.Request(url=website_url, callback=self.parse, errback=self.handle_errors, meta={'url': account_id})
def parse(self, response):
.....
yield item
def handle_errors(self, failure):
if failure.check(HttpError):
# these exceptions come from HttpError spider middleware
# you can get the non-200 response
response = failure.value.response
print('HttpError on ' + response.url)
elif failure.check(DNSLookupError):
# this is the original request
request = failure.request
print('DNSLookupError on ' + request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
print('TimeoutError on ' + request.url)
我的问题是,我会遇到我预期的错误,比如:
TimeoutError on http://www.example.com
但也可以对相同的网站进行追踪:
2016-08-05 13:40:55 [scrapy] ERROR: Error downloading <GET http://www.example.com/robots.txt>: TCP connection timed out: 60: Operation timed out.
Traceback (most recent call last):
File ".../anaconda/lib/python3.5/site-packages/twisted/internet/defer.py", line 1126, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File ".../anaconda/lib/python3.5/site-packages/twisted/python/failure.py", line 389, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File ".../anaconda/lib/python3.5/site-packages/scrapy/core/downloader/middleware.py", line 43, in process_request
defer.returnValue((yield download_func(request=request,spider=spider)))
twisted.internet.error.TCPTimedOutError: TCP connection timed out: 60: Operation timed out.
但我仍然看到调试消息,不仅仅是警告和上面的一切。(如果将configure_logging()添加到spider启动中)我正在mac os x上的终端上运行这个。
我很乐意得到任何帮助。在脚本中尝试以下内容:
if __name__ == '__main__':
runner = CrawlerProcess(get_project_settings())
spider = SiteSpider()
configure_logging()
d = runner.crawl(spider, websites_file=self.raw_data_file)
d.addBoth(lambda _: reactor.stop())
reactor.run()
if __name__ == '__main__':
runner = CrawlerProcess(get_project_settings())
spider = SiteSpider()
configure_logging()
d = runner.crawl(spider, websites_file=self.raw_data_file)
d.addBoth(lambda _: reactor.stop())
reactor.run()