Python Spider在获取多个失败URL时关闭_Python_Scrapy

Python Spider在获取多个失败URL时关闭

python scrapy

Python Spider在获取多个失败URL时关闭,python,scrapy,Python,Scrapy,最近，我不得不抓取一个巨大的URL列表，其中许多无法加载，加载时间过长，不存在等等当我的蜘蛛收到一系列这样的断开的URL时，它会自动关闭。我如何才能改变这种行为，并要求它不要在失败的URL上发汗，而只是跳过它们以下是我的错误跟踪： Error during info_callback Traceback (most recent call last): File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/prot

最近，我不得不抓取一个巨大的URL列表，其中许多无法加载，加载时间过长，不存在等等

当我的蜘蛛收到一系列这样的断开的URL时，它会自动关闭。我如何才能改变这种行为，并要求它不要在失败的URL上发汗，而只是跳过它们

以下是我的错误跟踪：

Error during info_callback
Traceback (most recent call last):
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 415, in dataReceived
    self._write(bytes)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/protocols/tls.py", line 554, in _write
    sent = self._tlsConnection.send(toSend)
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 949, in send
    result = _lib.SSL_write(self._ssl, buf, len(buf))
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
--- <exception caught here> ---
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1055, in infoCallback
    return wrapped(connection, where, ret)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1157, in _identityVerifyingInfoCallback
    transport = connection.get_app_data()
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
    return self._app_data
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
    return getattr(self._socket, name)
exceptions.AttributeError: 'NoneType' object has no attribute '_app_data'

From callback <function infoCallback at 0x7feaa9e3a8c0>:
Traceback (most recent call last):
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 702, in wrapper
    callback(Connection._reverse_mapping[ssl], where, return_code)
  File "/home/radar/anaconda/lib/python2.7/site-packages/twisted/internet/_sslverify.py", line 1059, in infoCallback
    connection.get_app_data().failVerification(f)
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 1224, in get_app_data
    return self._app_data
  File "/home/radar/anaconda/lib/python2.7/site-packages/OpenSSL/SSL.py", line 838, in __getattr__
    return getattr(self._socket, name)
AttributeError: 'NoneType' object has no attribute '_app_data'

错误是什么？为什么我的蜘蛛要接近这些？我该如何更改它？

第一个错误是scrapy中有一个bug：

可以通过安装

服务\u标识来解决此问题：
pip install service_identity

第二个问题是twisted无法连接到示例域。在这种情况下，没有什么可做的，因为URL被跳过了，没有任何问题——只记录在另一端没有任何内容。我认为这与您的爬行器关闭无关，而是由于上述错误而导致的错误。在我遇到十几次此类故障后，爬行器关闭。这是我的统计数据。我从一个excel文件中输入了数千个URL。我不确定，但如果你的数千个URL来自那些获得DNSLookupError的站点，那么可能是scrapy知道给定的域不可用，甚至不尝试爬网这些URL。再说一遍：我不确定，但我怀疑这正在发生。不，不是这样。我所有的URL都来自不同的域。这里还有其他因素在起作用。我想知道是什么。
pip install service_identity