Twisted Python失败-碎片问题_Python_Web Scraping_Scrapy_Twisted_Scrapy Spider

Twisted Python失败-碎片问题

python web-scraping scrapy

Twisted Python失败-碎片问题,python,web-scraping,scrapy,twisted,scrapy-spider,Python,Web Scraping,Scrapy,Twisted,Scrapy Spider,我正在尝试使用SCRAPY为任何搜索查询刮取此网站的搜索请求- 该网站使用AJAX（以XHR的形式）显示搜索结果。我设法跟踪了XHR，您在我的代码中注意到了它，如下所示（在for循环中，我将URL存储到temp，并在循环中递增“I”）：现在，当我执行此操作时，会出现意外错误-： 2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot) 2015-07-09 11:46:01 [scrapy] INFO: O

我正在尝试使用SCRAPY为任何搜索查询刮取此网站的搜索请求-

该网站使用AJAX（以XHR的形式）显示搜索结果。我设法跟踪了XHR，您在我的代码中注意到了它，如下所示（在for循环中，我将URL存储到temp，并在循环中递增“I”）：

现在，当我执行此操作时，会出现意外错误-：

2015-07-09 11:46:01 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-09 11:46:01 [scrapy] INFO: Optional features available: ssl, http11
2015-07-09 11:46:01 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-09 11:46:02 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-09 11:46:02 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-09 11:46:02 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-09 11:46:02 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-09 11:46:02 [scrapy] INFO: Spider opened
2015-07-09 11:46:02 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-09 11:46:02 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-09 11:46:03 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:09 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 11:46:13 [scrapy] INFO: Closing spider (finished)
2015-07-09 11:46:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 9, 6, 16, 13, 793446),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 9, 6, 16, 2, 890066)}
2015-07-09 11:46:13 [scrapy] INFO: Spider closed (finished)

这是我的更新输出，显示在终端-：

2015-07-10 13:06:00 [scrapy] INFO: Scrapy 1.0.0 started (bot: scrapybot)
2015-07-10 13:06:00 [scrapy] INFO: Optional features available: ssl, http11
2015-07-10 13:06:00 [scrapy] INFO: Overridden settings: {'DOWNLOAD_DELAY': 5}
2015-07-10 13:06:01 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-10 13:06:01 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-10 13:06:01 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-10 13:06:01 [scrapy] INFO: Enabled item pipelines: 
hi
2015-07-10 13:06:01 [scrapy] INFO: Spider opened
2015-07-10 13:06:01 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-10 13:06:01 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-10 13:06:02 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:08 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:12 [scrapy] ERROR: Error downloading <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1>: [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 13:06:13 [scrapy] INFO: Closing spider (finished)
2015-07-10 13:06:13 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 3,
 'downloader/request_bytes': 780,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2015, 7, 10, 7, 36, 13, 11023),
 'log_count/DEBUG': 4,
 'log_count/ERROR': 1,
 'log_count/INFO': 7,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2015, 7, 10, 7, 36, 1, 114912)}
2015-07-10 13:06:13 [scrapy] INFO: Spider closed (finished)

2015-07-10 13:06:00[scrapy]信息：scrapy 1.0.0已启动（bot:scrapybot）
2015-07-10 13:06:00[scrapy]信息：可选功能：ssl、http11
2015-07-10 13:06:00[scrapy]信息：覆盖的设置：{'DOWNLOAD_DELAY'：5}
2015-07-10 13:06:01[scrapy]信息：启用的扩展：CloseSpider、TelnetConsole、LogStats、CoreStats、SpiderState
2015-07-10 13:06:01[scrapy]信息：启用的下载中间件：HttpAuthMiddleware、DownloadTimeoutMiddleware、UserAgentMiddleware、RetryMiddleware、DefaultHeadersMiddleware、MetaRefreshMiddleware、HttpCompressionMiddleware、RedirectMiddleware、Cookies Middleware、ChunkedTransferMiddleware、DownloadersStats
2015-07-10 13:06:01[scrapy]信息：启用的蜘蛛中间件：HttpErrorMiddleware、OffItemIDdleware、RefererMiddleware、UrlLengthMiddleware、DepthMiddleware
2015-07-10 13:06:01[scrapy]信息：启用的项目管道：
你好
2015-07-10 13:06:01[scrapy]信息：蜘蛛打开
2015-07-10 13:06:01[抓取]信息：抓取0页（0页/分钟），抓取0项（0项/分钟）
2015-07-10 13:06:01[scrapy]调试：Telnet控制台监听127.0.0.1:6023
2015-07-10 13:06:02[scrapy]调试：重试（失败1次）：[]
2015-07-10 13:06:08[scrapy]调试：重试（失败2次）：[]
2015-07-10 13:06:12[scrapy]调试：放弃重试（失败3次）：[]
2015-07-10 13:06:12[scrapy]错误：下载错误：[]
2015-07-10 13:06:13[scrapy]信息：关闭卡盘（已完成）
2015-07-10 13:06:13[scrapy]信息：倾销scrapy统计数据：
{'downloader/exception_count'：3，
'downloader/exception\u type\u count/twisted.web.\u newclient.ResponseFailed'：3，
“下载程序/请求字节”：780，
“下载程序/请求计数”：3，
“下载程序/请求方法\计数/获取”：3，
“完成原因”：“完成”，
“完成时间”：datetime.datetime（2015,7,10,7,36,13,11023），
“日志计数/调试”：4，
“日志计数/错误”：1，
“日志计数/信息”：7，
“调度程序/出列”：3，
“调度程序/出列/内存”：3，
“调度程序/排队”：3，
“调度程序/排队/内存”：3，
“开始时间”：datetime.datetime（2015,7,10,7,36,114912）}
2015-07-10 13:06:13[scrapy]信息：卡盘关闭（完成）

因此，正如您所看到的，错误仍然是一样的！：（.所以，请帮我解决这个问题

更新-：

这是我试图捕获@JoeLinux建议执行的异常时的输出-：

>>> try:
...     fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
... except Exception as e:
...     e
... 
2015-07-10 17:51:13 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:14 [scrapy] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-10 17:51:15 [scrapy] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
ResponseFailed([<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>],)
>>> print e.reasons[0].getTraceback()
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
    why = selectable.doRead()
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 214, in doRead
    return self._dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 220, in _dataReceived
    rval = self.protocol.dataReceived(data)
  File "/usr/lib/python2.7/dist-packages/twisted/internet/endpoints.py", line 114, in dataReceived
    return self._wrappedProtocol.dataReceived(data)
--- <exception caught here> ---
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 1523, in dataReceived
    self._parser.dataReceived(bytes)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 382, in dataReceived
    HTTPParser.dataReceived(self, data)
  File "/usr/lib/python2.7/dist-packages/twisted/protocols/basic.py", line 571, in dataReceived
    why = self.lineReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 271, in lineReceived
    self.statusReceived(line)
  File "/usr/lib/python2.7/dist-packages/twisted/web/_newclient.py", line 409, in statusReceived
    raise ParseError("wrong number of parts", status)
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

>>试试：
…获取（“http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
…例外情况除外，如e：
E
... 
2015-07-10 17:51:13[scrapy]调试：重试（失败1次）：[]
2015-07-10 17:51:14[scrapy]调试：重试（失败2次）：[]
2015-07-10 17:51:15[scrapy]调试：放弃重试（失败3次）：[]
响应（[]，）
>>>打印e.Reasions[0]。getTraceback（）
回溯（最近一次呼叫最后一次）：
文件“/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py”，第614行，在_-doReadOrWrite中
why=selectable.doRead（）
文件“/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py”，第214行，在doRead中
返回自我。已接收数据（数据）
文件“/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py”，第220行，已收到数据
rval=self.protocol.dataReceived（数据）
文件“/usr/lib/python2.7/dist packages/twisted/internet/endpoints.py”，第114行，在dataReceived中
返回self.\u wrappedProtocol.dataReceived（数据）
---  ---
dataReceived中的文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”，第1523行
self.\u parser.dataReceived（字节）
dataReceived中的文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”，第382行
HTTPParser.dataReceived（self，data）
文件“/usr/lib/python2.7/dist packages/twisted/protocols/basic.py”，第571行，在dataReceived中
为什么=self.lineReceived（行）
文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”，第271行，在lineReceived中
自我状态已接收（行）
文件“/usr/lib/python2.7/dist packages/twisted/web/_newclient.py”，第409行，状态为已接收
错误（“零件数量错误”，状态）
twisted.web.\u newclient.ParseError:（“部件数量错误”，“HTTP/1.1500”）

我能够在

scrapy shell

中复制您的情况。以下是我在交互式shell中收到的错误：

$ scrapy shell 
...
>>> try:
>>>    fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>> except Exception as e:
>>>    e
2015-07-09 13:53:37-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
>>> print e.reasons[0].getTraceback()
...
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

$scrapy shell
...
>>>尝试：
>>>取回（“http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>>例外情况除外，如e：
>>>e
2015-07-09 13:53:37-0400[默认]调试：重试（失败1次）：[]
2015-07-09 13:53:38-0400[默认]调试：重试（失败2次）：[]
2015-07-09 13:53:38-0400[默认]调试：放弃重试（失败3次）：[]
>>>打印e.Reasions[0]。getTraceback（）
...
twisted.web.\u newclient.ParseError:（“部件数量错误”，“HTTP/1.1500”）

请注意，在我放置

…

的地方，有几行文字没有那么重要。最后一行显示的是“错误的部件数量”。通过谷歌搜索，我发现了这个问题：

最好的建议是有一个。通读线程并试一试。

我得到了同样的错误

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]

[]

现在它开始工作了

我想你可以试试这个：

在方法

\u monkey\u patching\u HTTPClientParser\u statusReceived

，将

从scrapy.xlib.tx更改为。\u新客户端从twisted.web导入HTTPClientParser，ParseError
到


在方法start\u requests
中，调用\u monkey\u patching\u HTTPClientParser\u statusReceiv
$ scrapy shell 
...
>>> try:
>>>    fetch("http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1")
>>> except Exception as e:
>>>    e
2015-07-09 13:53:37-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 2 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
2015-07-09 13:53:38-0400 [default] DEBUG: Gave up retrying <GET http://www.bewakoof.com/search/searchload/search_text/shirt/page_num/1> (failed 3 times): [<twisted.python.failure.Failure <class 'twisted.web._newclient.ParseError'>>]
>>> print e.reasons[0].getTraceback()
...
twisted.web._newclient.ParseError: ('wrong number of parts', 'HTTP/1.1 500')

[<twisted.python.failure.Failure twisted.web._newclient.ParseError: (u'wrong number of parts', 'HTTP/1.1 302')>]