Python scrapy如何处理请求';什么是回调函数结果?
任何人都可以解释scrapy如何调用和处理请求的回调函数结果 我知道scrapy可以接受对象的结果(请求、BaseItem、无)或对象的Iterable。例如: 1。返回对象(请求或基本项或无) 2。返回对象的IterablePython scrapy如何处理请求';什么是回调函数结果?,python,callback,scrapy,iterable,Python,Callback,Scrapy,Iterable,任何人都可以解释scrapy如何调用和处理请求的回调函数结果 我知道scrapy可以接受对象的结果(请求、BaseItem、无)或对象的Iterable。例如: 1。返回对象(请求或基本项或无) 2。返回对象的Iterable def parse(self, response): ... for url in self.urls: yield scrapy.Request(...) 我认为它们在scrapy代码中的某个地方是这样处理的 # Assumed pro
def parse(self, response):
...
for url in self.urls:
yield scrapy.Request(...)
我认为它们在scrapy代码中的某个地方是这样处理的
# Assumed process_callback_result is a function that called after
# a Request's callback function has been executed.
# The "result" parameter is the callback's returned value
def process_callback_result(self, result):
if isinstance(result, scrapy.Request):
self.process_request(result)
elif isinstance(result, scrapy.BaseItem):
self.process_item(result)
elif result is None:
pass
elif isinstance(result, collections.Iterable):
for obj in result:
self.process_callback_result(obj)
else:
# show error message
# ...
我在\u process\u spidermw\u output
函数中的/Lib/site packages/scrapy/core/scraper.py
中找到了相应的代码:
def _process_spidermw_output(self, output, request, response, spider):
"""Process each Request/Item (given in the output parameter) returned
from the given spider
"""
if isinstance(output, Request):
self.crawler.engine.crawl(request=output, spider=spider)
elif isinstance(output, BaseItem):
self.slot.itemproc_size += 1
dfd = self.itemproc.process_item(output, spider)
dfd.addBoth(self._itemproc_finished, output, response, spider)
return dfd
elif output is None:
pass
else:
typename = type(output).__name__
log.msg(format='Spider must return Request, BaseItem or None, '
'got %(typename)r in %(request)s',
level=log.ERROR, spider=spider, request=request, typename=typename)
但是我找不到elif isinstance(result,collections.Iterable)的部分:逻辑。这是因为
\u进程\u spidermw\u输出
只是单个项/对象的处理程序。它是从scrapy.utils.defer.parallel
调用的。这是处理spider输出的函数:
def handle_spider_output(self, result, request, response, spider):
if not result:
return defer_succeed(None)
it = iter_errback(result, self.handle_spider_error, request, response, spider)
dfd = parallel(it, self.concurrent_items,
self._process_spidermw_output, request, response, spider)
return dfd
资料来源:
如您所见,它调用并行
,并将\u process\u spidermw\u output
函数的句柄作为参数提供给它。参数名是可调用的
,它是为iterable
的每个元素调用的,其中包含spider结果。并行功能是:
def parallel(iterable, count, callable, *args, **named):
"""Execute a callable over the objects in the given iterable, in parallel,
using no more than ``count`` concurrent calls.
Taken from: http://jcalderone.livejournal.com/24285.html
"""
coop = task.Cooperator()
work = (callable(elem, *args, **named) for elem in iterable)
return defer.DeferredList([coop.coiterate(work) for i in xrange(count)])
资料来源:
基本上,这个过程是这样的:调用
enqueue\u scrape
时,它将请求
和响应
添加到插槽。通过调用插槽来排队
。添加响应请求
。然后,队列
由调用self的\u scrape\u next
处理。\u scrap
函数将handle\u spider\u输出定义为一个回调函数,该函数将处理迭代器中的项。迭代器是在调用\u scrape2
时创建的,当迭代器在某一点调用函数call\u spider
时,该函数将回调注册到scrapy.utils.spider.iterate\u spider\u输出
:
def iterate_spider_output(result):
return [result] if isinstance(result, BaseItem) else arg_to_iter(result)
最后,实际将单个项、无或迭代器转换为迭代器的函数是scrapy.utils.misc.arg_to_iter()
:
这意味着回调函数(例如def parse)返回的值(Request,BaseItem)总是转换成Iterable?将请求/基本项转换为Iterable的代码在哪里?从handle\u spider\u output
,我只能追溯到\u scrape
函数,但我并不真正理解其中的代码。
def iterate_spider_output(result):
return [result] if isinstance(result, BaseItem) else arg_to_iter(result)
def arg_to_iter(arg):
"""Convert an argument to an iterable. The argument can be a None, single
value, or an iterable.
Exception: if arg is a dict, [arg] will be returned
"""
if arg is None:
return []
elif not isinstance(arg, _ITERABLE_SINGLE_VALUES) and hasattr(arg, '__iter__'):
return arg
else:
return [arg]