来自管道的Scrapy异步api调用
我试图弄明白为什么来自管道的每个请求都像独立请求一样,而部分地忽略了AUTOTHROTTLE 想法是通过来自管道的Scrapy异步api调用,scrapy,scrapy-pipeline,Scrapy,Scrapy Pipeline,我试图弄明白为什么来自管道的每个请求都像独立请求一样,而部分地忽略了AUTOTHROTTLE 想法是通过spider.crawler.engine.download将收集到的项目从管道发送到RESTAPI 设置: AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 2 AUTOTHROTTLE_TARGET_CONCURRENCY = 50.0 AUTOTHROTTLE_DEBUG = True 假设我们有一个spider,它可以执行如
spider.crawler.engine.download将收集到的项目从管道发送到RESTAPI
设置:
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_TARGET_CONCURRENCY = 50.0
AUTOTHROTTLE_DEBUG = True
假设我们有一个spider,它可以执行如下请求(简化代码):
最后,我们希望将项目从管道发送到rest.api
from scrapy.http import JsonRequest
def process_item(self, item, spider):
item = dict(item)
request = JsonRequest("http://example.com/api/v1/item",
errback=self.errback_http,
method='POST',
meta={'download_slot': "site_slot"},
data=item)
spider.crawler.engine.download(request, spider)
return item
现在,一切看起来都很好,抓取页面和对RESTAPI进行异步请求(webserver的日志也显示post请求)。
标准输出的一部分:
> [scrapy.extensions.throttle] INFO: slot: site_slot | conc:10 | delay: 14 ms (-6) | latency: 404 ms | size: 14775 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 9 | delay: 13 ms (+0) | latency: 648 ms | size: 18353 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 8 | delay: 21 ms (+7) | latency: 1060 ms | size: 18030 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 7 | delay: 21 ms (+0) | latency: 1095 ms | size: 19249 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 7 | delay: 16 ms (-5) | latency: 508 ms | size: 16980 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 6 | delay: 25 ms (+9) | latency: 567 ms | size: 21663 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 6 | delay: 17 ms (-8) | latency: 447 ms | size: 14936 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 6 | delay: 13 ms (-3) | latency: 523 ms | size: 17130 bytes
但最好将api调用分离到另一个下载槽,并在管道请求中更改下载槽:
from scrapy.http import JsonRequest
def process_item(self, item, spider):
item = dict(item)
request = JsonRequest("http://example.com/api/v1/item",
errback=self.errback_http,
method='POST',
meta={'download_slot': "api_slot"},
data=item)
spider.crawler.engine.download(request, spider)
return item
这里是我的问题,现在每个api调用都像独立任务一样,并获得初始的AUTOTHROTTLE设置:
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay: 1019 ms (-980) | latency: 1930 ms | size: 17996 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay: 539 ms (-479) | latency: 2974 ms | size:367953 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay: 293 ms (-245) | latency: 2388 ms | size: 12509 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay: 168 ms (-125) | latency: 2145 ms | size: 13688 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 5 | delay: 101 ms (-67) | latency: 1703 ms | size: 13180 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 9 | delay: 67 ms (-33) | latency: 1682 ms | size: 11926 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:11 | delay: 40 ms (-27) | latency: 635 ms | size: 15215 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:15 | delay: 37 ms (-2) | latency: 1769 ms | size: 12859 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:21 | delay: 35 ms (-2) | latency: 1657 ms | size: 12739 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency: 503 ms | size: 57 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:25 | delay: 24 ms (-10) | latency: 707 ms | size: 15382 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay: 37 ms (+12) | latency: 1872 ms | size: 17532 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay: 38 ms (+1) | latency: 1929 ms | size: 15600 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay: 36 ms (-2) | latency: 1713 ms | size: 15435 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:31 | delay: 35 ms (+0) | latency: 1731 ms | size: 15198 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:33 | delay: 22 ms (-12) | latency: 502 ms | size: 15195 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:39 | delay: 28 ms (+5) | latency: 1402 ms | size: 15367 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay: 37 ms (+9) | latency: 1875 ms | size: 15192 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay: 23 ms (-13) | latency: 519 ms | size:366847 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay: 35 ms (+11) | latency: 1789 ms | size: 15150 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:46 | delay: 35 ms (+0) | latency: 1752 ms | size: 15209 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency: 526 ms | size: 57 bytes
删除meta={'download_slot':“api_slot”}
也不会产生任何效果
但是当我添加虚拟连接时,比如self.crawler.engine.crawl(请求http://example.com/api/v1/item“,meta={'download_slot':“api_slot”}),spider)
对于spider类,此下载槽的自动锁定逻辑对于两个请求(来自spider类和管道)再次变为活动状态“但是最好将api调用分离到另一个下载槽”。为什么?Autothrottle插件设计为分别限制每个下载槽。如果您需要其他东西,您需要创建自己的autothrottling扩展(希望将现有的once子类化以尽可能多地重用代码)。
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay: 1019 ms (-980) | latency: 1930 ms | size: 17996 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay: 539 ms (-479) | latency: 2974 ms | size:367953 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay: 293 ms (-245) | latency: 2388 ms | size: 12509 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay: 168 ms (-125) | latency: 2145 ms | size: 13688 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 5 | delay: 101 ms (-67) | latency: 1703 ms | size: 13180 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 9 | delay: 67 ms (-33) | latency: 1682 ms | size: 11926 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:11 | delay: 40 ms (-27) | latency: 635 ms | size: 15215 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:15 | delay: 37 ms (-2) | latency: 1769 ms | size: 12859 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:21 | delay: 35 ms (-2) | latency: 1657 ms | size: 12739 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency: 503 ms | size: 57 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:25 | delay: 24 ms (-10) | latency: 707 ms | size: 15382 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay: 37 ms (+12) | latency: 1872 ms | size: 17532 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay: 38 ms (+1) | latency: 1929 ms | size: 15600 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay: 36 ms (-2) | latency: 1713 ms | size: 15435 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:31 | delay: 35 ms (+0) | latency: 1731 ms | size: 15198 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:33 | delay: 22 ms (-12) | latency: 502 ms | size: 15195 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:39 | delay: 28 ms (+5) | latency: 1402 ms | size: 15367 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay: 37 ms (+9) | latency: 1875 ms | size: 15192 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay: 23 ms (-13) | latency: 519 ms | size:366847 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay: 35 ms (+11) | latency: 1789 ms | size: 15150 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:46 | delay: 35 ms (+0) | latency: 1752 ms | size: 15209 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency: 526 ms | size: 57 bytes