Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/353.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python web爬虫:连接超时_Python_Web Crawler_Beautifulsoup - Fatal编程技术网

Python web爬虫:连接超时

Python web爬虫:连接超时,python,web-crawler,beautifulsoup,Python,Web Crawler,Beautifulsoup,我正在尝试实现一个简单的web爬虫程序,我已经编写了一个简单的代码开始:有两个模块fetcher.py和crawler.py。以下是文件: fetcher.py: import urllib2 import re def fetcher(s): "fetch a web page from a url" try: req = urllib2.Request(s) urlResponse = urllib2.

我正在尝试实现一个简单的web爬虫程序,我已经编写了一个简单的代码开始:有两个模块fetcher.pycrawler.py。以下是文件:

fetcher.py:

    import urllib2
    import re
    def fetcher(s):
    "fetch a web page from a url"

    try:
            req = urllib2.Request(s)
            urlResponse = urllib2.urlopen(req).read()
    except urllib2.URLError as e:
            print e.reason
            return

    p,q = s.split("//")
    d = q.split("/")
    fdes = open(d[0],"w+")
    fdes.write(str(urlResponse))
    fdes.seek(0)
    return fdes



    if __name__ == "__main__":
    defaultSeed = "http://www.python.org"
    print fetcher(defaultSeed)
crawler.py:

from bs4 import BeautifulSoup
import re
from fetchpage import fetcher    

usedLinks = open("Used","a+")
newLinks = open("New","w+")

newLinks.seek(0)

def parse(fd,var=0):
        soup = BeautifulSoup(fd)
        for li in soup.find_all("a",href=re.compile("http")):
                newLinks.seek(0,2)
                newLinks.write(str(li.get("href")).strip("/"))
                newLinks.write("\n")

        fd.close()
        newLinks.seek(var)
        link = newLinks.readline().strip("\n")

        return str(link)


def crawler(seed,n):
        if n == 0:
                usedLinks.close()
                newLinks.close()
                return
        else:
                usedLinks.write(seed)
                usedLinks.write("\n")
                fdes = fetcher(seed)
                newSeed = parse(fdes,newLinks.tell())
                crawler(newSeed,n-1)

if __name__ == "__main__":
        crawler("http://www.python.org/",7)
问题是,当我运行crawler.py时,它在前4-5个链接中工作正常,然后它挂起,一分钟后会出现以下错误:

[Errno 110] Connection timed out
   Traceback (most recent call last):
  File "crawler.py", line 37, in <module>
    crawler("http://www.python.org/",7)
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
 File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 34, in crawler
    crawler(newSeed,n-1)        
  File "crawler.py", line 33, in crawler
    newSeed = parse(fdes,newLinks.tell())
  File "crawler.py", line 11, in parse
    soup = BeautifulSoup(fd)
  File "/usr/lib/python2.7/dist-packages/bs4/__init__.py", line 169, in __init__
    self.builder.prepare_markup(markup, from_encoding))
  File "/usr/lib/python2.7/dist-packages/bs4/builder/_lxml.py", line 68, in     prepare_markup
    dammit = UnicodeDammit(markup, try_encodings, is_html=True)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 191, in __init__
    self._detectEncoding(markup, is_html)
  File "/usr/lib/python2.7/dist-packages/bs4/dammit.py", line 362, in _detectEncoding
    xml_encoding_match = xml_encoding_re.match(xml_data)
TypeError: expected string or buffer
[Errno 110]连接超时
回溯(最近一次呼叫最后一次):
文件“crawler.py”,第37行,在
爬虫(“http://www.python.org/",7)
文件“crawler.py”,第34行,在crawler中
爬虫机(新闻种子,n-1)
文件“crawler.py”,第34行,在crawler中
爬虫机(新闻种子,n-1)
文件“crawler.py”,第34行,在crawler中
爬虫机(新闻种子,n-1)
文件“crawler.py”,第34行,在crawler中
爬虫机(新闻种子,n-1)
文件“crawler.py”,第34行,在crawler中
爬虫机(新闻种子,n-1)
文件“crawler.py”,第33行,在crawler中
newSeed=parse(fdes,newLinks.tell())
解析中第11行的文件“crawler.py”
汤=美汤(fd)
文件“/usr/lib/python2.7/dist-packages/bs4/_-init___;.py”,第169行,在__-init中__
self.builder.prepare_标记(标记,来自_编码))
文件“/usr/lib/python2.7/dist packages/bs4/builder/_lxml.py”,第68行,在准备标记中
dammit=UnicodeAmmit(标记,尝试编码,是html=True)
文件“/usr/lib/python2.7/dist packages/bs4/dammit.py”,第191行,在__
self.\u detectEncoding(标记,即html)
文件“/usr/lib/python2.7/dist packages/bs4/dammit.py”,第362行,in_detectEncoding
xml\u encoding\u match=xml\u encoding\u re.match(xml\u数据)
TypeError:应为字符串或缓冲区
有人能帮我吗?我对python非常陌生,我不知道为什么说连接超时了一段时间后?

A不是python特有的,它只是意味着您向服务器发出了请求,而服务器在您的应用程序愿意等待的时间内没有响应


发生这种情况的一个很可能的原因是python.org可能有某种机制来检测它何时从脚本获取多个请求,并且可能在4-5个请求之后完全停止服务页面。除了在另一个站点上尝试脚本之外,您实在无法避免这种情况。

您可以尝试使用代理来避免在上述多个请求中被检测到。您可能想查看这个答案,了解如何使用代理发送urllib请求:

是。每次链接的确切数量后,它就会被卡住。我在其他网站上运行了相同的脚本,但它没有给我这个问题。:)谢谢你的回复。