来自管道的Scrapy异步api调用

来自管道的Scrapy异步api调用,scrapy,scrapy-pipeline,Scrapy,Scrapy Pipeline,我试图弄明白为什么来自管道的每个请求都像独立请求一样,而部分地忽略了AUTOTHROTTLE 想法是通过spider.crawler.engine.download将收集到的项目从管道发送到RESTAPI 设置: AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 2 AUTOTHROTTLE_TARGET_CONCURRENCY = 50.0 AUTOTHROTTLE_DEBUG = True 假设我们有一个spider,它可以执行如

我试图弄明白为什么来自管道的每个请求都像独立请求一样,而部分地忽略了AUTOTHROTTLE

想法是通过
spider.crawler.engine.download将收集到的项目从管道发送到RESTAPI

设置:

AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 2
AUTOTHROTTLE_TARGET_CONCURRENCY = 50.0
AUTOTHROTTLE_DEBUG = True
假设我们有一个spider,它可以执行如下请求(简化代码):

最后,我们希望将项目从管道发送到rest.api

from scrapy.http import JsonRequest

def process_item(self, item, spider):
    item = dict(item)
    request = JsonRequest("http://example.com/api/v1/item",
                          errback=self.errback_http,
                          method='POST',
                          meta={'download_slot': "site_slot"},
                          data=item)
    spider.crawler.engine.download(request, spider)
    return item
现在,一切看起来都很好,抓取页面和对RESTAPI进行异步请求(webserver的日志也显示post请求)。 标准输出的一部分:

> [scrapy.extensions.throttle] INFO: slot: site_slot | conc:10 | delay:   14 ms (-6) | latency:  404 ms | size: 14775 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 9 | delay:   13 ms (+0) | latency:  648 ms | size: 18353 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 8 | delay:   21 ms (+7) | latency: 1060 ms | size: 18030 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 7 | delay:   21 ms (+0) | latency: 1095 ms | size: 19249 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 7 | delay:   16 ms (-5) | latency:  508 ms | size: 16980 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 6 | delay:   25 ms (+9) | latency:  567 ms | size: 21663 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 6 | delay:   17 ms (-8) | latency:  447 ms | size: 14936 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot |
conc: 6 | delay:   13 ms (-3) | latency:  523 ms | size: 17130 bytes
但最好将api调用分离到另一个下载槽,并在管道请求中更改下载槽:

from scrapy.http import JsonRequest

def process_item(self, item, spider):
    item = dict(item)
    request = JsonRequest("http://example.com/api/v1/item",
                          errback=self.errback_http,
                          method='POST',
                          meta={'download_slot': "api_slot"},
                          data=item)
    spider.crawler.engine.download(request, spider)
    return item
这里是我的问题,现在每个api调用都像独立任务一样,并获得初始的AUTOTHROTTLE设置:

>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay: 1019 ms (-980) | latency: 1930 ms | size: 17996 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay:  539 ms (-479) | latency: 2974 ms | size:367953 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay:  293 ms (-245) | latency: 2388 ms | size: 12509 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay:  168 ms (-125) | latency: 2145 ms | size: 13688 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 5 | delay:  101 ms (-67) | latency: 1703 ms | size: 13180 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 9 | delay:   67 ms (-33) | latency: 1682 ms | size: 11926 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:11 | delay:   40 ms (-27) | latency:  635 ms | size: 15215 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:15 | delay:   37 ms (-2) | latency: 1769 ms | size: 12859 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:21 | delay:   35 ms (-2) | latency: 1657 ms | size: 12739 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency:  503 ms | size:    57 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:25 | delay:   24 ms (-10) | latency:  707 ms | size: 15382 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay:   37 ms (+12) | latency: 1872 ms | size: 17532 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay:   38 ms (+1) | latency: 1929 ms | size: 15600 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay:   36 ms (-2) | latency: 1713 ms | size: 15435 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:31 | delay:   35 ms (+0) | latency: 1731 ms | size: 15198 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:33 | delay:   22 ms (-12) | latency:  502 ms | size: 15195 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:39 | delay:   28 ms (+5) | latency: 1402 ms | size: 15367 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay:   37 ms (+9) | latency: 1875 ms | size: 15192 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay:   23 ms (-13) | latency:  519 ms | size:366847 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay:   35 ms (+11) | latency: 1789 ms | size: 15150 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:46 | delay:   35 ms (+0) | latency: 1752 ms | size: 15209 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency:  526 ms | size:    57 bytes
删除
meta={'download_slot':“api_slot”}
也不会产生任何效果


但是当我添加虚拟连接时,比如
self.crawler.engine.crawl(请求http://example.com/api/v1/item“,meta={'download_slot':“api_slot”}),spider)
对于spider类,此下载槽的自动锁定逻辑对于两个请求(来自spider类和管道)再次变为活动状态

“但是最好将api调用分离到另一个下载槽”。为什么?Autothrottle插件设计为分别限制每个下载槽。如果您需要其他东西,您需要创建自己的autothrottling扩展(希望将现有的once子类化以尽可能多地重用代码)。
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay: 1019 ms (-980) | latency: 1930 ms | size: 17996 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 2 | delay:  539 ms (-479) | latency: 2974 ms | size:367953 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay:  293 ms (-245) | latency: 2388 ms | size: 12509 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 4 | delay:  168 ms (-125) | latency: 2145 ms | size: 13688 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 5 | delay:  101 ms (-67) | latency: 1703 ms | size: 13180 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc: 9 | delay:   67 ms (-33) | latency: 1682 ms | size: 11926 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:11 | delay:   40 ms (-27) | latency:  635 ms | size: 15215 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:15 | delay:   37 ms (-2) | latency: 1769 ms | size: 12859 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:21 | delay:   35 ms (-2) | latency: 1657 ms | size: 12739 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency:  503 ms | size:    57 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:25 | delay:   24 ms (-10) | latency:  707 ms | size: 15382 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay:   37 ms (+12) | latency: 1872 ms | size: 17532 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay:   38 ms (+1) | latency: 1929 ms | size: 15600 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:30 | delay:   36 ms (-2) | latency: 1713 ms | size: 15435 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:31 | delay:   35 ms (+0) | latency: 1731 ms | size: 15198 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:33 | delay:   22 ms (-12) | latency:  502 ms | size: 15195 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:39 | delay:   28 ms (+5) | latency: 1402 ms | size: 15367 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay:   37 ms (+9) | latency: 1875 ms | size: 15192 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay:   23 ms (-13) | latency:  519 ms | size:366847 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:43 | delay:   35 ms (+11) | latency: 1789 ms | size: 15150 bytes
>[scrapy.extensions.throttle] INFO: slot: site_slot | conc:46 | delay:   35 ms (+0) | latency: 1752 ms | size: 15209 bytes
>[scrapy.extensions.throttle] INFO: slot: api_slot | conc: 1 | delay: 2000 ms (+0) | latency:  526 ms | size:    57 bytes