Python 修复回溯(最近一次呼叫上次)错误?

Python 修复回溯(最近一次呼叫上次)错误?,python,python-2.7,Python,Python 2.7,我编写了一个程序来提取网页中PDF文件的所有链接。该程序运行完美,在某些网站上没有错误,例如: Hussam# python extractPDF.py http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html 输出: Entered URL: http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html Final URL: http://www.cs.odu.edu/

我编写了一个程序来提取网页中PDF文件的所有链接。该程序运行完美,在某些网站上没有错误,例如:

Hussam# python extractPDF.py http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html
输出:

Entered URL:
http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html
Final URL:
http://www.cs.odu.edu/~mln/teaching/cs532-s17/test/pdfs.html
http://www.cs.odu.edu/~mln/pubs/ht-2015/hypertext-2015-temporal-violations.pdf
Size: 2184076
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-annotations.pdf
Size: 622981
http://arxiv.org/pdf/1512.06195
Size: 1748961
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-off-topic.pdf
Size: 4308768
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-stories.pdf
Size: 1274604
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-profiling.pdf
Size: 639001
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-brunelle-damage.pdf
Size: 2205546
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-mink.pdf
Size: 1254605
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-arabic-sites.pdf
Size: 709420
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-dictionary.pdf
Size: 2350603
另一方面,如果我尝试此链接:

Hussam# python extractPDF.py http://www.cs.odu.edu/~mln/pubs/all.html
我得到了正确的输出,但最后有一个错误

Entered URL:
http://www.cs.odu.edu/~mln/pubs/all.html
Final URL:
http://www.cs.odu.edu/~mln/pubs/all.html
http://www.cs.odu.edu/~mln/pubs/tpdl-2016/tpdl-2016-kelly.pdf
Size: 953454
http://www.cs.odu.edu/~mln/pubs/tpdl-2016/tpdl-2016-alam.pdf
Size: 928749
http://www.cs.odu.edu/~mln/pubs/jcdl-2016/jcdl-2016-alam-ipfs.pdf
Size: 516538
http://www.cs.odu.edu/~mln/pubs/jcdl-2016/jcdl-2016-alam-memgator.pdf
Size: 345028
http://www.cs.odu.edu/~mln/pubs/jcdl-2016/jcdl-2016-nwala.pdf
Size: 640173
http://www.cs.odu.edu/~mln/pubs/ht-2015/hypertext-2015-temporal-violations.pdf
Size: 2184076
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-annotations.pdf
Size: 622981
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-off-topic.pdf
Size: 4308768
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-stories.pdf
Size: 1274604
http://www.cs.odu.edu/~mln/pubs/tpdl-2015/tpdl-2015-profiling.pdf
Size: 639001
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-temporal-intention.pdf
Size: 720476
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-mink.pdf
Size: 1254605
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-arabic-sites.pdf
Size: 709420
http://www.cs.odu.edu/~mln/pubs/jcdl-2015/jcdl-2015-dictionary.pdf
Size: 2350603
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-kelly-acid.pdf
Size: 541843
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-kelly-mink.pdf
Size: 556863
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-brunelle-damage.pdf
Size: 2205546
http://www.cs.odu.edu/~mln/pubs/jcdl-2014/jcdl-2014-cartledge-copies.pdf
Size: 1199511
http://www.cs.odu.edu/~mln/pubs/sigcse-2014/web-science-sigcse-2014.pdf
Size: 158242
http://www.cs.odu.edu/~mln/pubs/ecir-2014/ecir-2014.pdf
Size: 902825
http://www.cs.odu.edu/~mln/pubs/ieee-vis-2013/2013-ieee-vis-boxoffice.pdf
Size: 122738
Traceback (most recent call last):
  File "extractPDF.py", line 21, in <module>
    r = urllib2.urlopen(link)
  File "/usr/lib/python2.7/urllib2.py", line 126, in urlopen
    return _opener.open(url, data, timeout)
  File "/usr/lib/python2.7/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 429, in error
    result = self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 605, in http_error_302
    return self.parent.open(new, timeout=req.timeout)
  File "/usr/lib/python2.7/urllib2.py", line 397, in open
    response = meth(req, response)
  File "/usr/lib/python2.7/urllib2.py", line 510, in http_response
    'http', request, response, code, msg, hdrs)
  File "/usr/lib/python2.7/urllib2.py", line 435, in error
    return self._call_chain(*args)
  File "/usr/lib/python2.7/urllib2.py", line 369, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 518, in http_error_default
    raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
urllib2.HTTPError: HTTP Error 403: Forbidden
您的代码获取页面中的所有链接。至少有一个链接(不一定是PDF的链接)对您不可用。意思是“服务器理解请求但拒绝授权”。url可能要求您具有允许访问的凭据

urllib2
引发错误条件的异常。您的代码将需要处理其中一些

如果您只想让代码继续运行而不消亡,请将相关部分替换为:

    for link in links:
        r = None
        try:
            r = urllib2.urlopen(link)
        except urllib2.HTTPError as e:
            print link
            print "Error: " + e.code + " " + e.reason
            continue

        if r.headers['content-type'] == "application/pdf":
            print link
            print "Size: " + r.headers['Content-Length']

403禁止。您不允许访问
req.get\u furll\u url()
中的任何值。你不明白什么?问题在于@WayneWerner指出的你正在访问的url。尝试检查HttpError异常。检查此以了解更多信息。您的爬虫是否因为被检测而被阻止?好的,谢谢。。我的爬虫程序没有被阻止,但我忘记写一个异常来处理HttpError。
urllib2.HTTPError: HTTP Error 403: Forbidden
    for link in links:
        r = None
        try:
            r = urllib2.urlopen(link)
        except urllib2.HTTPError as e:
            print link
            print "Error: " + e.code + " " + e.reason
            continue

        if r.headers['content-type'] == "application/pdf":
            print link
            print "Size: " + r.headers['Content-Length']