Python Scrapy不与Crawlera合作
我一直在使用Scrapy的Crawlera,它很棒。但是,我在Crawlera仪表板中更改了API密钥,从那以后我就无法让Crawlera正常工作。我联系了他们的客户支持,他们说API密钥工作正常。我决定尝试让Crawlera使用Scrapy文档中的示例。不走运。Scrapy正在向“dmoz.org”而不是paygo.com发出请求。我安装了scrapy crawlera和scrapy 以下是日志:Python Scrapy不与Crawlera合作,python,scrapy,Python,Scrapy,我一直在使用Scrapy的Crawlera,它很棒。但是,我在Crawlera仪表板中更改了API密钥,从那以后我就无法让Crawlera正常工作。我联系了他们的客户支持,他们说API密钥工作正常。我决定尝试让Crawlera使用Scrapy文档中的示例。不走运。Scrapy正在向“dmoz.org”而不是paygo.com发出请求。我安装了scrapy crawlera和scrapy 以下是日志: [scrapy] INFO: Using crawlera at http://paygo.cr
[scrapy] INFO: Using crawlera at http://paygo.crawlera.com:8010?noconnect (user: [my_api_key])
2015-08-10 19:16:24 [scrapy] DEBUG: Telnet console listening on [my_ip_address]
2015-08-10 19:16:26 [scrapy] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
2015-08-10 19:16:26 [scrapy] INFO: Closing spider (finished)
2015-08-10 19:16:26 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 660,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 16445,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 8, 11, 2, 16, 26, 990760),
'log_count/DEBUG': 3,
'log_count/INFO': 8,
'log_count/WARNING': 2,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2015, 8, 11, 2, 16, 24, 720987)}
2015-08-10 19:16:26 [scrapy] INFO: Spider closed (finished)
在setting.py文件中,您需要配置“DOWNLOADER\u middleware” 例如:
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'
CRAWLERA_ENABLED = True
CRAWLERA_USER = 'user'
CRAWLERA_PASS = 'password'
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_crawlera.CrawleraMiddleware': 600,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': "http://" + CRAWLERA_USER + ":" + CRAWLERA_PASS + "@proxy.crawlera.com:8010/",
}
Scrapy向“dmoz.org”而不是paygo提出请求你怎么会这么想?Scrapy crawlera使用crawlera作为HTTP代理,因此对于Scrapy日志,URL仍然是“dmoz.org”。要确认使用了Crawlera,您可以在parse
回调中打印response.headers
,您应该会看到一些X-Crawlera-*
头
DUPEFILTER=True
COOKIES_ENABLED=False
RANDOMIZE_DOWNLOAD_DELAY=True
SCHEDULER_ORDER='BFO'
CRAWLERA_ENABLED = True
CRAWLERA_USER = 'user'
CRAWLERA_PASS = 'password'
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.spidermiddleware.referer.RefererMiddleware': True,
'scrapy_crawlera.CrawleraMiddleware': 600,
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': "http://" + CRAWLERA_USER + ":" + CRAWLERA_PASS + "@proxy.crawlera.com:8010/",
}