Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/337.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/python-2.7/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
python错误104对等端重置连接_Python_Python 2.7_Web Scraping_Web Crawler_Pyspider - Fatal编程技术网

python错误104对等端重置连接

python错误104对等端重置连接,python,python-2.7,web-scraping,web-crawler,pyspider,Python,Python 2.7,Web Scraping,Web Crawler,Pyspider,我不明白为什么我总是遇到这个错误,或者如何修复它。我已经运行了一系列不同的URL,但这种错误并不是每次都会发生。是我可以修复的东西,还是我的代码中我可以修复的东西,或者这是我无法修复的东西 我已经查看了堆栈溢出,类似于我的问题的东西不起作用 下面是我使用vagrant和python 2.7运行的代码: import urllib2 from urlparse import urljoin from urlparse import urlparse from bs4 import Beautifu

我不明白为什么我总是遇到这个错误,或者如何修复它。我已经运行了一系列不同的URL,但这种错误并不是每次都会发生。是我可以修复的东西,还是我的代码中我可以修复的东西,或者这是我无法修复的东西

我已经查看了堆栈溢出,类似于我的问题的东西不起作用

下面是我使用vagrant和python 2.7运行的代码:

import urllib2
from urlparse import urljoin
from urlparse import urlparse
from bs4 import BeautifulSoup
import re
import socket


def check_web_health(root,max_depth):
    domain = get_domain(root)
    filter_domain = [domain]
    tocrawl = [[root,1]]
    crawled = {}
    count=0;
    while tocrawl:
        crawl_ele = tocrawl.pop()
        link = crawl_ele[0]
        depth = crawl_ele[1]

        if link not in crawled.keys():
            content, status = get_page(link)
            if content == None:
                crawled[link]= status
                continue
            host = get_domain(link)
            if depth < max_depth and host in filter_domain:
                outlinks = get_all_links(content,link)
                print '-----------------------------------'
                print 'Adding outlinks ' + str(outlinks) + ' for parent page '+link
                print '-----------------------------------'
                add_to_tocrawl(crawled.keys(),tocrawl, outlinks, depth+1)
            crawled[link]= status

    f = open('site_health.txt', 'w')
    for url,status in crawled.iteritems():
        f.write(url)
        f.write('\t')
        f.write('\t')
        f.write(status)
        f.write('\n')
    f.close()

def get_domain(url):
    hostname = urlparse(url).hostname
    if len(re.findall( r'[0-9]+(?:\.[0-9]+){3}', hostname)) > 0:
        return hostname
    elif len(hostname.split('.')) == 0:
        hostname
    elif hostname.find('www.') != -1:
        return hostname.split('.')[0]
    else:
        return hostname.split('.')[1]

def get_page(url):
    print url
    try:
        response = urllib2.urlopen(url)
        return response.read(), 'OK'
    except urllib2.HTTPError,e:
        return None, str(e.code)
    except urllib2.URLError,e:
        print e.args
        return None, 'Invalid Url'
    except:
        return None, 'Wrong Url'

def get_next_target(page,parent):
    start_link = page.find('<a href=')
    if start_link == -1:
        return None, 0
    start_quote = page.find('"', start_link)
    end_quote = page.find('"', start_quote + 1)
    url = page[start_quote + 1:end_quote]
    url = urljoin(parent,url)
    return url, end_quote

def get_all_links(page,parent):
    links = []
    while True:
        url, endpos = get_next_target(page,parent)
        if url:
            links.append(url)
            page = page[endpos:]
        else:
            break
    return links


def add_to_tocrawl(crawled, tocrawl, newlinks, depth):
    for link in newlinks:
        if link not in tocrawl and link not in crawled:
            tocrawl.append([link,depth])


check_web_health('https://www.chicagomaps.org',3) # put any URL from the internt here 
导入urllib2
从urlparse导入urljoin
从URLPRASE导入URLPRASE
从bs4导入BeautifulSoup
进口稀土
导入套接字
def检查网络运行状况(根,最大深度):
域=获取域(根)
过滤器\域=[域]
tocrawl=[[root,1]]
爬网={}
计数=0;
而托克罗尔:
爬网=tocrawl.pop()
link=crawl\u ele[0]
深度=爬行距离[1]
如果链接不在爬网的.keys()中:
内容,状态=获取页面(链接)
如果内容==无:
已爬网[链接]=状态
持续
主机=获取域(链接)
如果深度<最大深度且主机位于筛选器域中:
outlinks=获取所有链接(内容、链接)
打印'--------------------------------------'
打印“添加大纲链接”+str(大纲链接)+父页面“+link”
打印'--------------------------------------'
添加到绘制中(crawled.keys(),绘制,大纲链接,深度+1)
已爬网[链接]=状态
f=打开('site_health.txt','w')
对于url,爬网的.iteritems()中的状态为:
f、 写入(url)
f、 写入('\t')
f、 写入('\t')
f、 写入(状态)
f、 写入('\n')
f、 关闭()
def get_域(url):
hostname=urlparse(url).hostname
如果len(re.findall(r'[0-9]+(?:\[0-9]+){3}',hostname))>0:
返回主机名
elif len(主机名.split('.')==0:
主机名
elif hostname.find('www.')!=-1:
返回主机名。拆分('.')[0]
其他:
返回hostname.split('.')[1]
def get_页面(url):
打印url
尝试:
response=urlib2.urlopen(url)
返回response.read(),“OK”
除urllib2.HTTPError外,e:
返回None,str(e.code)
除urllib2.URLError外,e:
打印e.args
返回None,“无效Url”
除:
返回None,“错误的Url”
def get_next_目标(第页,父级):

start_link=page.find('检查这个类似的问题,应该会有帮助!在发布问题之前先搜索一下,你的问题可能与这个问题完全相同:可能是重复的我以前看过这些帖子,我尝试过,但仍然出现错误