Scrapy Crawlera中间件命令以启用httpcache_Scrapy

Scrapy Crawlera中间件命令以启用httpcache

scrapy

Scrapy Crawlera中间件命令以启用httpcache,scrapy,Scrapy,我不想对已经使用httpcache中间件缓存的页面使用crawlera代理服务（因为我对每月的调用次数有限制）我正在使用crawlera中间件，并通过以下方式启用它： DOWNLOADER_MIDDLEWARES = { 'scrapy_crawlera.CrawleraMiddleware': 610} 按照文档中的建议（）不过，爬网结束后，我得到： 2017-04-23 00:14:24 [scrapy.statscollectors] INFO: Dumping Scrapy

我不想对已经使用httpcache中间件缓存的页面使用crawlera代理服务（因为我对每月的调用次数有限制）

我正在使用crawlera中间件，并通过以下方式启用它：

DOWNLOADER_MIDDLEWARES = {
'scrapy_crawlera.CrawleraMiddleware': 610}

按照文档中的建议（）

不过，爬网结束后，我得到：

    2017-04-23 00:14:24 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'crawlera/request': 11,
 'crawlera/request/method/GET': 11,
 'crawlera/response': 11,
 'crawlera/response/status/200': 10,
 'crawlera/response/status/301': 1,
 'downloader/request_bytes': 3324,
 'downloader/request_count': 11,
 'downloader/request_method_count/GET': 11,
 'downloader/response_bytes': 1352925,
 'downloader/response_count': 11,
 'downloader/response_status_count/200': 10,
 'downloader/response_status_count/301': 1,
 'dupefilter/filtered': 6,
 'finish_reason': 'closespider_pagecount',
 'finish_time': datetime.datetime(2017, 4, 22, 22, 14, 24, 839013),
 'httpcache/hit': 11,
 'log_count/DEBUG': 12,
 'log_count/INFO': 9,
 'request_depth_max': 1,
 'response_received_count': 10,
 'scheduler/dequeued': 10,
 'scheduler/dequeued/memory': 10,
 'scheduler/enqueued': 23,
 'scheduler/enqueued/memory': 23,
 'start_time': datetime.datetime(2017, 4, 22, 22, 14, 24, 317893)}
2017-04-23 00:14:24 [scrapy.core.engine] INFO: Spider closed (closespider_pagecount)

与

所以我不确定这些调用是否通过crawlera代理服务进行。当我将crawlera中间件顺序更改为901749751时，得到了相同的结果

有人知道引擎盖下面发生了什么吗？页面是否直接从http缓存返回而不调用crawlera服务器

谢谢

将该数字视为其他中间产品的参考

'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 600,
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620

只要确保httpcache.HttpCacheMiddleware的数量低于代理中间件

'scrapy.downloadermiddlewares.httpcache.HttpCacheMiddleware': 600,
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620

这对我来说很好。

在最后的统计数据中有

'httpcache/hit'：11，

，所以我相信你使用了HTTP缓存。在日志中，您还应该看到页面下载时的

'cached'

标志以及crawlera/request/method/GET的情况：11它意味着

scrapy crawlera

中间件处理了11个请求（它只是将crawlera端点添加为请求的代理）。你在日志里看到了什么？你看到URL的

'cached'

标志了吗？我看到了'cached'，但我如何判断它是否没有执行crawlera调用，是否忽略了它并返回了缓存版本？这取决于中间商的订单。据我所知，“crawlera/request/method/GET”意味着它确实与crawlera服务器进行了交互。事实并非如此：当处理

请求

实例时，stats计数器会增加，并且

代理

键会更新。没别的了。您是否尝试过断开与网络的连接？在Scrapy中没有通过网络发送的实际字节的日志（不过这可能是一个不错的添加）