Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/359.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Twisted Python失败-碎片问题_Python_Web Scraping_Scrapy_Twisted_Scrapy Spider - Fatal编程技术网

Twisted Python失败-碎片问题

Twisted Python失败-碎片问题,python,web-scraping,scrapy,twisted,scrapy-spider,Python,Web Scraping,Scrapy,Twisted,Scrapy Spider,我正在尝试使用SCRAPY为任何搜索查询刮取此网站的搜索请求- 该网站使用AJAX(以XHR的形式)显示搜索结果。我设法跟踪了XHR,您在我的代码中注意到了它,如下所示(在for循环中,我将URL存储到temp,并在循环中递增“I”): 现在,当我执行此操作时,会出现意外错误-: 2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-07-09 11:46:01 [scrapy] INFO: O

我正在尝试使用SCRAPY为任何搜索查询刮取此网站的搜索请求-

该网站使用AJAX(以XHR的形式)显示搜索结果。我设法跟踪了XHR,您在我的代码中注意到了它,如下所示(在for循环中,我将URL存储到temp,并在循环中递增“I”):

现在,当我执行此操作时,会出现意外错误-:

2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11
2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-09 11:46:02 [scrapy] INFO: Spider opened
2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished)
2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}
2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)
这是我的更新输出,显示在终端-:

2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11
2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-10 13:06:01 [scrapy] INFO: Spider opened
2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished)
2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}
2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)
2015-07-10 13:06:00[scrapy]信息:scrapy 1.0.0已启动(bot:scrapybot)
2015-07-10 13:06:00[scrapy]信息:可选功能:ssl、http11
2015-07-10 13:06:00[scrapy]信息:覆盖的设置:{'DOWNLOAD_DELAY':5}
2015-07-10 13:06:01[scrapy]信息:启用的扩展:CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState
2015-07-10 13:06:01[scrapy]信息:启用的下载中间件:HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2015-07-10 13:06:01[scrapy]信息:启用的蜘蛛中间件:HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2015-07-10 13:06:01[scrapy]信息:启用的项目管道:
你好
2015-07-10 13:06:01[scrapy]信息:蜘蛛打开
2015-07-10 13:06:01[抓取]信息:抓取0页(0页/分钟),抓取0项(0项/分钟)
2015-07-10 13:06:01[scrapy]调试:Telnet控制台监听127.0.0.1:6023
2015-07-10 13:06:02[scrapy]调试:重试(失败1次):[]
2015-07-10 13:06:08[scrapy]调试:重试(失败2次):[]
2015-07-10 13:06:12[scrapy]调试:放弃重试(失败3次):[]
2015-07-10 13:06:12[scrapy]错误:下载错误:[]
2015-07-10 13:06:13[scrapy]信息:关闭卡盘(已完成)
2015-07-10 13:06:13[scrapy]信息:倾销scrapy统计数据:
{'downloader/exception_count':3,
'downloader/exception\u type\u count/twisted.web.\u newclient.ResponseFailed':3,
“下载程序/请求字节”:780,
“下载程序/请求计数”:3,
“下载程序/请求方法\计数/获取”:3,
“完成原因”:“完成”,
“完成时间”:datetime.datetime(2015,7,10,7,36,13,11023),
“日志计数/调试”:4,
“日志计数/错误”:1,
“日志计数/信息”:7,
“调度程序/出列”:3,
“调度程序/出列/内存”:3,
“调度程序/排队”:3,
“调度程序/排队/内存”:3,
“开始时间”:datetime.datetime(2015,7,10,7,36,114912)}
2015-07-10 13:06:13[scrapy]信息:卡盘关闭(完成)
因此,正如您所看到的,错误仍然是一样的!:(.所以,请帮我解决这个问题

更新-:

这是我试图捕获@JoeLinux建议执行的异常时的输出-:

>>> try:
...     fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
... except Exception as e:
...     e
... 
2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],)
>>> print e.reasons[0].getTraceback()
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead
    return self._dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived
    self._parser.dataReceived(bytes)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
    why = self.lineReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived
    self.statusReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived
    raise ParseError("wrong number of parts", status)
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')
>>试试:
…获取(“http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
…例外情况除外,如e:
E
... 
2015-07-10 17:51:13[scrapy]调试:重试(失败1次):[]
2015-07-10 17:51:14[scrapy]调试:重试(失败2次):[]
2015-07-10 17:51:15[scrapy]调试:放弃重试(失败3次):[]
响应([],)
>>>打印e.Reasions[0]。getTraceback()
回溯(最近一次呼叫最后一次):
文件“/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py”,第614行,在_-doReadOrWrite中
why=selectable.doRead()
文件“/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py”,第214行,在doRead中
返回自我。已接收数据(数据)
文件“/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py”,第220行,已收到数据
rval=self.protocol.dataReceived(数据)
文件“/usr/lib/python2.7/dist packages/twisted/internet/endpoints.py”,第114行,在dataReceived中
返回self.\u wrappedProtocol.dataReceived(数据)
---  ---
dataReceived中的文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”,第1523行
self.\u parser.dataReceived(字节)
dataReceived中的文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”,第382行
HTTPParser.dataReceived(self,data)
文件“/usr/lib/python2.7/dist packages/twisted/protocols/basic.py”,第571行,在dataReceived中
为什么=self.lineReceived(行)
文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”,第271行,在lineReceived中
自我状态已接收(行)
文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”,第409行,状态为已接收
错误(“零件数量错误”,状态)
twisted.web.\u newclient.ParseError:(“部件数量错误”,“HTTP/1.1500”)

我能够在
scrapy shell
中复制您的情况。以下是我在交互式shell中收到的错误:

$ scrapy shell 
...
>>> try:
>>>    fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>> except Exception as e:
>>>    e
2015-07-09 13:53:37-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
>>> print e.reasons[0].getTraceback()
...
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')
$scrapy shell
...
>>>尝试:
>>>取回(“http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>>例外情况除外,如e:
>>>e
2015-07-09 13:53:37-0400[默认]调试:重试(失败1次):[]
2015-07-09 13:53:38-0400[默认]调试:重试(失败2次):[]
2015-07-09 13:53:38-0400[默认]调试:放弃重试(失败3次):[]
>>>打印e.Reasions[0]。getTraceback()
...
twisted.web.\u newclient.ParseError:(“部件数量错误”,“HTTP/1.1500”)
请注意,在我放置
的地方,有几行文字没有那么重要。最后一行显示的是“错误的部件数量”。通过谷歌搜索,我发现了这个问题:

最好的建议是有一个。通读线程并试一试。

我得到了同样的错误

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]
[]
现在它开始工作了

我想你可以试试这个:

  • 在方法
    \u monkey\u patching\u HTTPClientParser\u statusReceived
    ,将
    从scrapy.xlib.tx更改为
    。\u新客户端从twisted.web导入HTTPClientParser,ParseError

  • 在方法
    start\u requests
    中,调用
    \u monkey\u patching\u HTTPClientParser\u statusReceiv
    
    $ scrapy shell 
    ...
    >>> try:
    >>>    fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
    >>> except Exception as e:
    >>>    e
    2015-07-09 13:53:37-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
    2015-07-09 13:53:38-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
    2015-07-09 13:53:38-0400 [default] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
    >>> print e.reasons[0].getTraceback()
    ...
    twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')
    
    [<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]