Python Scrapy Cloud spider请求因GeneratorExit失败
我有一个在本地工作的Scrapy多级spider,但在每次请求时都在云中返回GeneratorExit 以下是解析方法:Python Scrapy Cloud spider请求因GeneratorExit失败,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,我有一个在本地工作的Scrapy多级spider,但在每次请求时都在云中返回GeneratorExit 以下是解析方法: def parse(self, response): results = list(response.css(".list-group li a::attr(href)")) for c in results: meta = {} for key in response.meta.keys(): meta
def parse(self, response):
results = list(response.css(".list-group li a::attr(href)"))
for c in results:
meta = {}
for key in response.meta.keys():
meta[key] = response.meta[key]
yield response.follow(c,
callback=self.parse_category,
meta=meta,
errback=self.errback_httpbin)
def parse_category(self, response):
category_results = list(response.css(
".item a.link-unstyled::attr(href)"))
category = response.css(".active [itemprop='title']")
for r in category_results:
meta = {}
for key in response.meta.keys():
meta[key] = response.meta[key]
meta["category"] = category
yield response.follow(r, callback=self.parse_item,
meta=meta,
errback=self.errback_httpbin)
def errback_httpbin(self, failure):
# log all failures
self.logger.error(repr(failure))
以下是回溯:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 102, in iter_errback
yield next(it)
GeneratorExit
[stderr] Exception ignored in: <generator object iter_errback at 0x7fdea937a9e8>
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1243, in run
self.mainLoop()
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 1252, in mainLoop
self.runUntilCurrent()
File "/usr/local/lib/python3.6/site-packages/twisted/internet/base.py", line 878, in runUntilCurrent
call.func(*call.args, **call.kw)
File "/usr/local/lib/python3.6/site-packages/twisted/internet/task.py", line 671, in _tick
taskObj._oneWorkUnit()
--- <exception caught here> ---
File "/usr/local/lib/python3.6/site-packages/twisted/internet/task.py", line 517, in _oneWorkUnit
result = next(self._iterator)
File "/usr/local/lib/python3.6/site-packages/scrapy/utils/defer.py", line 63, in <genexpr>
work = (callable(elem, *args, **named) for elem in iterable)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scraper.py", line 183, in _process_spidermw_output
self.crawler.engine.crawl(request=output, spider=spider)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 210, in crawl
self.schedule(request, spider)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/engine.py", line 216, in schedule
if not self.slot.scheduler.enqueue_request(request):
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 57, in enqueue_request
dqok = self._dqpush(request)
File "/usr/local/lib/python3.6/site-packages/scrapy/core/scheduler.py", line 86, in _dqpush
self.dqs.push(reqd, -request.priority)
File "/usr/local/lib/python3.6/site-packages/queuelib/pqueue.py", line 35, in push
q.push(obj) # this may fail (eg. serialization error)
File "/usr/local/lib/python3.6/site-packages/scrapy/squeues.py", line 15, in push
s = serialize(obj)
File "/usr/local/lib/python3.6/site-packages/scrapy/squeues.py", line 27, in _pickle_serialize
return pickle.dumps(obj, protocol=2)
builtins.TypeError: can't pickle HtmlElement objects
回溯(最近一次呼叫最后一次):
文件“/usr/local/lib/python3.6/site packages/scrapy/utils/defer.py”,第102行,在iter\u errback中
下一个(it)
发电退出
[stderr]在中忽略了异常:
文件“/usr/local/lib/python3.6/site packages/twisted/internet/base.py”,第1243行,正在运行
self.mainLoop()
文件“/usr/local/lib/python3.6/site packages/twisted/internet/base.py”,第1252行,在mainLoop中
self.rununtlcurrent()
文件“/usr/local/lib/python3.6/site packages/twisted/internet/base.py”,第878行,在rununtlcurrent中
call.func(*call.args,**call.kw)
文件“/usr/local/lib/python3.6/site packages/twisted/internet/task.py”,第671行,勾选
taskObj._oneWorkUnit()
--- ---
文件“/usr/local/lib/python3.6/site packages/twisted/internet/task.py”,第517行,在一个工作单元中
结果=下一个(自身迭代)
文件“/usr/local/lib/python3.6/site packages/scrapy/utils/defer.py”,第63行,在
work=(iterable中的elem可调用(elem,*args,**命名)
文件“/usr/local/lib/python3.6/site packages/scrapy/core/scraper.py”,第183行,进程内输出
self.crawler.engine.crawl(请求=输出,spider=spider)
文件“/usr/local/lib/python3.6/site packages/scrapy/core/engine.py”,第210行,在爬网中
自我计划(请求,spider)
文件“/usr/local/lib/python3.6/site packages/scrapy/core/engine.py”,附表第216行
如果不是self.slot.scheduler.enqueue_请求(请求):
文件“/usr/local/lib/python3.6/site packages/scrapy/core/scheduler.py”,第57行,在排队请求中
dqok=自我。\u dqpush(请求)
文件“/usr/local/lib/python3.6/site packages/scrapy/core/scheduler.py”,第86行,在
自我dqs推送(请求-请求优先级)
文件“/usr/local/lib/python3.6/site packages/queuelib/pqueue.py”,第35行,在push中
q、 推送(obj)#这可能会失败(例如序列化错误)
文件“/usr/local/lib/python3.6/site packages/scrapy/sques.py”,第15行,在push中
s=序列化(obj)
文件“/usr/local/lib/python3.6/site packages/scrapy/sques.py”,第27行,在pickle中序列化
返回pickle.dumps(obj,协议=2)
builtins.TypeError:无法pickle HtmleElement对象
我设置了errback,但它没有提供任何错误详细信息。我还在每个请求中都写了meta,但这没有任何区别。我错过什么了吗
更新:
这一错误似乎是多层spider所固有的。现在,我只使用一种解析方法重写了这一条。其中一个原因是启用了该设置,这使得Scrapy将请求序列化到磁盘队列而不是内存队列中
序列化到磁盘时,Pickle操作失败,因为request.meta
dict包含SelectorList
对象(在category=response.css(“.active[itemprop='title']”)行中指定),并且选择器包含lxml.html.htmlement
对象的实例(无法对其进行pickle,并且此问题不在Scrapy范围内),因此出现TypeError:can not pickle HtmleElement对象
有一个解决此问题的方法。它没有修复Pickle操作,它所做的是指示调度程序不应尝试将此类请求序列化到磁盘,而是将它们转到内存中。其中一个方法是启用了该设置,这使Scrapy序列化请求进入磁盘队列而不是内存队列
序列化到磁盘时,Pickle操作失败,因为request.meta
dict包含SelectorList
对象(在category=response.css(“.active[itemprop='title']”)行中指定),并且选择器包含lxml.html.htmlement
对象的实例(无法对其进行pickle,并且此问题不在Scrapy范围内),因此出现TypeError:can not pickle HtmleElement对象
有一个解决此问题的方法。它没有修复Pickle操作,它所做的是指示调度程序不应尝试将此类请求序列化到磁盘,而是将它们放入内存。从PR中可以看出,它将应用于V1.6中,但尚未发布到堆栈1.5.1-py3中,它还会解决以下问题:它无法拾取ItemLoader()将响应传递到构造函数中的对象?builtins.TypeError:无法pickle选择器对象
,我正在将ItemLoader
对象传递到请求元中。thanksI可以从PR中看到,它将应用于V1.6,但尚未发布到堆栈1.5.1-py3,它是否也会解决无法pickle ItemLoader的问题()将响应传递到构造函数中的对象?builtins.TypeError:无法pickle选择器对象
,我正在将ItemLoader
对象传递到请求元中。谢谢