Python urllib2，如何避免错误-需要帮助吗_Python_Http_Web Crawler_Web Scraping

Python urllib2，如何避免错误-需要帮助吗

python http web-crawler web-scraping

Python urllib2，如何避免错误-需要帮助吗,python,http,web-crawler,web-scraping,Python,Http,Web Crawler,Web Scraping,我正在使用python urllib2从web下载页面。我没有使用任何类型的用户代理等。我得到以下示例错误。有人能告诉我一个简单的方法来避免他们吗 http://www.rottentomatoes.com/m/foxy_brown/ The server couldn't fulfill the request. Error code: 403 http://www.spiritus-temporis.com/marc-platt-dancer-/ The server couldn't

我正在使用python urllib2从web下载页面。我没有使用任何类型的用户代理等。我得到以下示例错误。有人能告诉我一个简单的方法来避免他们吗

http://www.rottentomatoes.com/m/foxy_brown/
The server couldn't fulfill the request.
Error code:  403


http://www.spiritus-temporis.com/marc-platt-dancer-/
The server couldn't fulfill the request.
Error code:  503

http://www.golf-equipment-guide.com/news/Mark-Nichols-(golfer).html!!
The server couldn't fulfill the request.
Error code:  500


http://www.ehx.com/blog/mike-matthews-in-fuzz-documentary!!
We failed to reach a server.
Reason:  timed out
IncompleteRead(5621 bytes read)
Traceback (most recent call last):
    File "download.py", line 43, in <module>
    localFile.write(response.read())
    File "/usr/lib/python2.6/socket.py", line 327, in read
    data = self._sock.recv(rbufsize)
    File "/usr/lib/python2.6/httplib.py", line 517, in read
    return self._read_chunked(amt)
    File "/usr/lib/python2.6/httplib.py", line 563, in _read_chunked
    raise IncompleteRead(value)
IncompleteRead: IncompleteRead(5621 bytes read)

http://www.rottentomatoes.com/m/foxy_brown/
服务器无法完成请求。
错误代码：403
http://www.spiritus-temporis.com/marc-platt-dancer-/
服务器无法完成请求。
错误代码：503
http://www.golf-equipment-guide.com/news/Mark-Nichols-（高尔夫球手）html！！
服务器无法完成请求。
错误代码：500
http://www.ehx.com/blog/mike-matthews-in-fuzz-documentary!!
我们无法到达服务器。
原因：超时
不完全读取（读取5621字节）
回溯（最近一次呼叫最后一次）：
文件“download.py”，第43行，在
localFile.write（response.read（））
文件“/usr/lib/python2.6/socket.py”，第327行，已读
数据=self.\u sock.recv（rbufsize）
文件“/usr/lib/python2.6/httplib.py”，第517行，已读
返回自我。读取块（金额）
文件“/usr/lib/python2.6/httplib.py”，第563行，分块读取
提升未完成读取（值）
不完全读取：不完全读取（读取5621字节）

谢谢你

Bala

许多web资源需要某种cookie或其他身份验证才能访问，您的403状态码很可能就是这种情况的结果

503错误往往意味着您正在以循环的方式快速访问服务器上的资源，您需要在尝试另一次访问之前稍等片刻

500的例子似乎根本不存在

超时错误可能不需要“！！”，我只能加载没有它的资源

我建议您阅读HTTP状态代码。

对于那些更复杂的任务，您可能需要考虑使用机械化、斜纹、甚至硒或风车，这将支持更兼容的SeNeNiOS，包括Cookie或JavaScript支持。对于random网站，仅使用urllib2（签名cookies，有人吗？）可能会很棘手