Python IOError（e.errno，e.strerror，e.filename）IOError:[errno 2]没有这样的文件或目录：_Python_Web Crawler_Ioerror

Python IOError（e.errno，e.strerror，e.filename）IOError:[errno 2]没有这样的文件或目录：

python web-crawler

Python IOError（e.errno，e.strerror，e.filename）IOError:[errno 2]没有这样的文件或目录：,python,web-crawler,ioerror,Python,Web Crawler,Ioerror,我是python新手。当我试图运行一个只获取页面中链接的爬虫程序时，我遇到了这个错误。我已经安装了python 2.7并在osx上工作。我的爬虫所做的是，它进入页面并尝试查找该页面中存在的所有链接，并将所有这些链接存储在列表中。接下来，我们尝试爬网所有新链接，并继续重复相同的操作，直到没有要爬网的链接为止 File "crawler.py", line 44, in <module> print crawl_web("https://en.wikipedia.org/wiki/D

我是python新手。当我试图运行一个只获取页面中链接的爬虫程序时，我遇到了这个错误。我已经安装了python 2.7并在osx上工作。我的爬虫所做的是，它进入页面并尝试查找该页面中存在的所有链接，并将所有这些链接存储在列表中。接下来，我们尝试爬网所有新链接，并继续重复相同的操作，直到没有要爬网的链接为止

 File "crawler.py", line 44, in <module>
 print crawl_web("https://en.wikipedia.org/wiki/Devil_May_Cry_4")
 File "crawler.py", line 7, in crawl_web
   union(tocrawl,get_all_links(get_page(page)))
 File "crawler.py", line 19, in get_page
   response = urllib.urlopen(a)
 File" /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
 File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 469, in open_file
return self.open_local_file(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory:'/w/load.phpdebug=false&amp;lang=en&amp;
modules=ext.cite.styles|ext.gadget.DRN-wizard,ReferenceTooltips,charinsert,featured-
articleslinks,refToolbar,switcher,teahouse|ext.wikimediaBadges&amp;only=styles&amp;
skin=vector'

你的实现有缺陷。html中的链接可能是相对的，例如

/index.php

/en.wikipedia.org/index.php

，因此您必须检测相对链接并添加协议和主机前缀。

您的实现存在缺陷。html中的链接可能是相对的，例如

/index.php

/en.wikipedia.org/index.php

，因此您必须检测相对链接并添加协议和主机前缀

def crawl_web(page):
  tocrawl = [page]
  crawled = []
  while tocrawl:
     page = tocrawl.pop()
     if page not in crawled:
         union(tocrawl,get_all_links(get_page(page)))
         crawled.append(page)
     return crawled

def union(a,b):
 for e in b:
    if e not in a:
        a.append(e)

import urllib
def get_page(a):
 response = urllib.urlopen(a)
 data = response.read()
return data

def get_all_links(page):
 links = []
 while True:
    url,endpos = get_next_target(page)
    if url:
        links.append(url)
        page = page[endpos:]
    else:
        break
return links

def get_next_target(page):
  start_link = page.find('href=')
  if start_link == -1:
    return None,0
  start_quote = page.find('"',start_link)
  end_quote = page.find('"',start_quote+1)
  url = page[start_quote +1:end_quote]
  return url,end_quote


print crawl_web("https://en.wikipedia.org/wiki/Devil_May_Cry_4")'