Python IOError(e.errno,e.strerror,e.filename)IOError:[errno 2]没有这样的文件或目录:
我是python新手。当我试图运行一个只获取页面中链接的爬虫程序时,我遇到了这个错误。我已经安装了python 2.7并在osx上工作。我的爬虫所做的是,它进入页面并尝试查找该页面中存在的所有链接,并将所有这些链接存储在列表中。接下来,我们尝试爬网所有新链接,并继续重复相同的操作,直到没有要爬网的链接为止Python IOError(e.errno,e.strerror,e.filename)IOError:[errno 2]没有这样的文件或目录:,python,web-crawler,ioerror,Python,Web Crawler,Ioerror,我是python新手。当我试图运行一个只获取页面中链接的爬虫程序时,我遇到了这个错误。我已经安装了python 2.7并在osx上工作。我的爬虫所做的是,它进入页面并尝试查找该页面中存在的所有链接,并将所有这些链接存储在列表中。接下来,我们尝试爬网所有新链接,并继续重复相同的操作,直到没有要爬网的链接为止 File "crawler.py", line 44, in <module> print crawl_web("https://en.wikipedia.org/wiki/D
File "crawler.py", line 44, in <module>
print crawl_web("https://en.wikipedia.org/wiki/Devil_May_Cry_4")
File "crawler.py", line 7, in crawl_web
union(tocrawl,get_all_links(get_page(page)))
File "crawler.py", line 19, in get_page
response = urllib.urlopen(a)
File" /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 213, in open
return getattr(self, name)(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 469, in open_file
return self.open_local_file(url)
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] No such file or directory:'/w/load.phpdebug=false&lang=en&
modules=ext.cite.styles|ext.gadget.DRN-wizard,ReferenceTooltips,charinsert,featured-
articleslinks,refToolbar,switcher,teahouse|ext.wikimediaBadges&only=styles&
skin=vector'
你的实现有缺陷。html中的链接可能是相对的,例如
/index.php
/en.wikipedia.org/index.php
,因此您必须检测相对链接并添加协议和主机前缀。您的实现存在缺陷。html中的链接可能是相对的,例如/index.php
/en.wikipedia.org/index.php
,因此您必须检测相对链接并添加协议和主机前缀
def crawl_web(page):
tocrawl = [page]
crawled = []
while tocrawl:
page = tocrawl.pop()
if page not in crawled:
union(tocrawl,get_all_links(get_page(page)))
crawled.append(page)
return crawled
def union(a,b):
for e in b:
if e not in a:
a.append(e)
import urllib
def get_page(a):
response = urllib.urlopen(a)
data = response.read()
return data
def get_all_links(page):
links = []
while True:
url,endpos = get_next_target(page)
if url:
links.append(url)
page = page[endpos:]
else:
break
return links
def get_next_target(page):
start_link = page.find('href=')
if start_link == -1:
return None,0
start_quote = page.find('"',start_link)
end_quote = page.find('"',start_quote+1)
url = page[start_quote +1:end_quote]
return url,end_quote
print crawl_web("https://en.wikipedia.org/wiki/Devil_May_Cry_4")'