Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/287.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 某些站点上的刮擦超时_Python_Web Scraping_Scrapy - Fatal编程技术网

Python 某些站点上的刮擦超时

Python 某些站点上的刮擦超时,python,web-scraping,scrapy,Python,Web Scraping,Scrapy,在我自己的机器上我试过 > scrapy fetch http://google.com/ 或 工作完美,不知何故www.flyertalk.com与scrapy的关系并不好。我不断收到超时错误(180秒): 然而,卷发效果很好,没有打嗝 > curl -s http://www.flyertalk.com/ 很奇怪。以下是完整转储: 2015-11-20 17:35:07 [scrapy] INFO: Enabled extensions: CloseSpider, Tel

在我自己的机器上我试过

> scrapy fetch http://google.com/ 

工作完美,不知何故www.flyertalk.com与scrapy的关系并不好。我不断收到超时错误(180秒):

然而,卷发效果很好,没有打嗝

> curl -s http://www.flyertalk.com/ 
很奇怪。以下是完整转储:

2015-11-20 17:35:07 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-20 17:35:07 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-20 17:35:07 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-20 17:35:07 [scrapy] INFO: Enabled item pipelines: 
2015-11-20 17:35:07 [scrapy] INFO: Spider opened
2015-11-20 17:35:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:35:07 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6037
2015-11-20 17:36:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:37:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:38:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:38:07 [scrapy] DEBUG: Retrying <GET http://www.flyertalk.com> (failed 1 times): User timeout caused connection failure: Getting http://www.flyertalk.com took longer than 180.0 seconds..
2015-11-20 17:39:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:40:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:41:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:41:07 [scrapy] DEBUG: Retrying <GET http://www.flyertalk.com> (failed 2 times): User timeout caused connection failure: Getting http://www.flyertalk.com took longer than 180.0 seconds..
2015-11-20 17:35:07[scrapy]信息:启用的扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState
2015-11-20 17:35:07[剪贴]信息:启用的下载中间件:HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,MetaRefreshMiddleware,HttpCompressionMiddleware,RedirectMiddleware,Cookies Middleware,ChunkedTransferMiddleware,DownloadersStats
2015-11-20 17:35:07[剪贴]信息:启用的蜘蛛中间件:HttpErrorMiddleware、OffsiteMiddleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2015-11-20 17:35:07[scrapy]信息:启用的项目管道:
2015-11-20 17:35:07[剪贴]信息:蜘蛛打开
2015-11-20 17:35:07[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-20 17:35:07[scrapy]调试:Telnet控制台监听127.0.0.1:6037
2015-11-20 17:36:07[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-20 17:37:07[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-20 17:38:07[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-20 17:38:07[scrapy]调试:重试(失败1次):用户超时导致连接失败:获取http://www.flyertalk.com 耗时超过180.0秒。。
2015-11-20 17:39:07[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-20 17:40:07[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-20 17:41:07[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-11-20 17:41:07[scrapy]调试:重试(失败2次):用户超时导致连接失败:获取http://www.flyertalk.com 耗时超过180.0秒。。

我做了一些实验。
USER-AGENT
标题带来了所有不同:

$ scrapy shell http://www.flyertalk.com/ -s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'
In [1]: response.xpath("//title/text()").extract_first().strip()
Out[1]: u"FlyerTalk - The world's most popular frequent flyer community - FlyerTalk is a living, growing community where frequent travelers around the world come to exchange knowledge and experiences about everything miles and points related."
如果不指定标题,我将看到它永远挂起

2015-11-20 17:35:07 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-11-20 17:35:07 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-11-20 17:35:07 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-11-20 17:35:07 [scrapy] INFO: Enabled item pipelines: 
2015-11-20 17:35:07 [scrapy] INFO: Spider opened
2015-11-20 17:35:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:35:07 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6037
2015-11-20 17:36:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:37:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:38:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:38:07 [scrapy] DEBUG: Retrying <GET http://www.flyertalk.com> (failed 1 times): User timeout caused connection failure: Getting http://www.flyertalk.com took longer than 180.0 seconds..
2015-11-20 17:39:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:40:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:41:07 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-11-20 17:41:07 [scrapy] DEBUG: Retrying <GET http://www.flyertalk.com> (failed 2 times): User timeout caused connection failure: Getting http://www.flyertalk.com took longer than 180.0 seconds..
$ scrapy shell http://www.flyertalk.com/ -s USER_AGENT='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.80 Safari/537.36'
In [1]: response.xpath("//title/text()").extract_first().strip()
Out[1]: u"FlyerTalk - The world's most popular frequent flyer community - FlyerTalk is a living, growing community where frequent travelers around the world come to exchange knowledge and experiences about everything miles and points related."