Python HTTPConnectionPool无法建立新连接：[Errno 11004]getaddrinfo失败_Python_Multithreading_Python Requests_Pool

Python HTTPConnectionPool无法建立新连接：[Errno 11004]getaddrinfo失败

python multithreading

Python HTTPConnectionPool无法建立新连接：[Errno 11004]getaddrinfo失败,python,multithreading,python-requests,pool,Python,Multithreading,Python Requests,Pool,我想知道我的请求是否被网站阻止了，我需要设置一个代理。我首先尝试关闭http的连接，但失败了。我也尝试测试我的代码，但现在似乎没有输出。Mybe我使用代理一切都会好吗？这是代码 import requests from urllib.parse import urlencode import json from bs4 import BeautifulSoup import re from html.parser import HTMLParser from multiprocessing i

我想知道我的请求是否被网站阻止了，我需要设置一个代理。我首先尝试关闭http的连接，但失败了。我也尝试测试我的代码，但现在似乎没有输出。Mybe我使用代理一切都会好吗？这是代码

import requests
from urllib.parse import urlencode
import json
from bs4 import BeautifulSoup
import re
from html.parser import HTMLParser
from multiprocessing import Pool
from requests.exceptions import RequestException
import time


def get_page_index(offset, keyword):
    #headers = {'User-Agent':'Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50'}
    data = {
        'offset': offset,
        'format': 'json',
        'keyword': keyword,
        'autoload': 'true',
        'count': 20,
        'cur_tab': 1
    }
    url = 'http://www.toutiao.com/search_content/?' + urlencode(data)
    try:
        response = requests.get(url, headers={'Connection': 'close'})
        response.encoding = 'utf-8'
        if response.status_code == 200:
            return response.text
        return None
    except RequestException as e:
        print(e)

def parse_page_index(html):
    data = json.loads(html)
    if data and 'data' in data.keys():
        for item in data.get('data'):
            url = item.get('article_url')
            if url and len(url) < 100:
                yield url

def get_page_detail(url):
    try:
        response = requests.get(url, headers={'Connection': 'close'})
        response.encoding = 'utf-8'
        if response.status_code == 200:
            return response.text
        return None
    except RequestException as e:
        print(e)

def parse_page_detail(html):
    soup = BeautifulSoup(html, 'lxml')
    title = soup.select('title')[0].get_text()
    pattern = re.compile(r'articleInfo: (.*?)},', re.S)
    pattern_abstract = re.compile(r'abstract: (.*?)\.', re.S)
    res = re.search(pattern, html)
    res_abstract = re.search(pattern_abstract, html)
    if res and res_abstract:
        data = res.group(1).replace(r".replace(/<br \/>|\n|\r/ig, '')", "") + '}'
        abstract = res_abstract.group(1).replace(r"'", "")
        content = re.search(r'content: (.*?),', data).group(1)
        source = re.search(r'source: (.*?),', data).group(1)
        time_pattern = re.compile(r'time: (.*?)}', re.S)
        date = re.search(time_pattern, data).group(1)
        date_today = time.strftime('%Y-%m-%d')
        img = re.findall(r'src=&quot;(.*?)&quot', content)
        if date[1:11] == date_today and len(content) > 50 and img:
            return {
                'title': title,
                'content': content,
                'source': source,
                'date': date,
                'abstract': abstract,
                'img': img[0]
            }

def main(offset):
    flag = 1
    html = get_page_index(offset, '光伏')
    for url in parse_page_index(html):
        html = get_page_detail(url)
        if html:
            data = parse_page_detail(html)
            if data:
                html_parser = HTMLParser()
                cwl = html_parser.unescape(data.get('content'))
                data['content'] = cwl
                print(data)
                print(data.get('img'))
                flag += 1
                if flag == 5:
                    break



if __name__ == '__main__':
    pool = Pool()
    pool.map(main, [i*20 for i in range(10)])

导入请求
从urllib.parse导入urlencode
导入json
从bs4导入BeautifulSoup
进口稀土
从html.parser导入HTMLParser
来自多处理导入池
从requests.exceptions导入RequestException
导入时间
def get_page_索引（偏移量、关键字）：
#headers={'User-Agent'：'Mozilla/5.0（Macintosh；U；Intel Mac OS X 10_6_8；en-us）AppleWebKit/534.50（KHTML，如Gecko）Version/5.1 Safari/534.50'}
数据={
“偏移”：偏移，
“格式”：“json”，
“关键字”：关键字，
“自动加载”：“true”，
“计数”：20，
“当前选项卡”：1
}
url='1〕http://www.toutiao.com/search_content/?'+urlencode（数据）
尝试：
response=requests.get（url，headers={'Connection'：'close'}）
response.encoding='utf-8'
如果response.status_code==200：
返回response.text
一无所获
除RequestException外，如e：
打印（e）
def解析页面索引（html）：
data=json.loads（html）
如果data.keys（）中的数据和“数据”：
对于data.get（'data'）中的项：
url=item.get（'article\u url'）
如果url和len（url）<100：
收益url
def获取页面详细信息（url）：
尝试：
response=requests.get（url，headers={'Connection'：'close'}）
response.encoding='utf-8'
如果response.status_code==200：
返回response.text
一无所获
除RequestException外，如e：
打印（e）
def解析页面详细信息（html）：
soup=BeautifulSoup（html，“lxml”）
title=soup。选择（'title'）[0]。获取文本（）
pattern=re.compile（r'articleInfo:（.*）}'，re.S）
模式_abstract=re.compile（r'abstract:（.*）\，re.S）
res=re.search（模式，html）
res\u abstract=re.search（模式抽象，html）
如果res和res_摘要：
data=res.group（1）.replace（r）.replace（/|\n |\r/ig，），“）+”}
abstract=res_abstract.group（1）.替换（r“，”）
内容=重新搜索（r'内容：（.*），数据）。组（1）
source=re.search（r'source:（.*？），数据）。组（1）
time_pattern=re.compile（r'time:（.*？}'，re.S）
日期=重新搜索（时间模式，数据）。组（1）
date\u today=time.strftime（“%Y-%m-%d”）
img=re.findall（r'src=“（.*）”内容）
如果日期[1:11]==今天的日期且len（内容）>50且img：
返回{
“标题”：标题，
“内容”：内容，
“源”：源，
“日期”：日期，
“抽象”：抽象，
“img”：img[0]
}
def干管（偏置）：
标志=1
html=获取页面索引（偏移量，'光伏')
对于解析页面索引（html）中的url：
html=获取页面详细信息（url）
如果是html：
数据=解析页面详细信息（html）
如果数据：
html_parser=HTMLParser（）
cwl=html\u parser.unescape（data.get（'content'））
数据['content']=cwl
打印（数据）
打印（data.get（'img'））
标志+=1
如果标志==5：
打破
如果uuuu name uuuuuu='\uuuuuuu main\uuuuuuu'：
池=池（）
pool.map（主[i*20表示范围内的i（10）]）

错误就在这里

HTTPConnectionPool(host='tech.jinghua.cn', port=80): Max retries exceeded with url: /zixun/20160720/f191549.shtml (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x00000000048523C8>: Failed to establish a new connection: [Errno 11004] getaddrinfo failed',))

HTTPConnectionPool（host='tech.jinghua.cn'，port=80）：url:/zixun/20160720/f191549.shtml超过了最大重试次数（由NewConnectionError引起（'：未能建立新连接：[Errno 11004]getaddrinfo Failed'，））

顺便说一下，当我第一次测试我的代码时，它显示一切正常！

提前感谢！

在我看来，您在HTTPConnectionPool中的连接达到了极限。因为您同时启动了10个线程

请尝试以下操作之一：

增加请求超时（秒）：

requests.get（'url'，timeout=5）

关闭响应：

response.Close（）

。将响应分配给变量，关闭响应，然后返回变量，而不是返回response.text

当我面对这个问题时，我有以下问题

我没能做到以下几点 -请求python模块无法从任何url获取信息。虽然我可以使用浏览器浏览网站，但也可以让wget或curl下载该页面。 -pip安装也不工作，使用失败，出现以下错误

无法建立新连接：[Errno 11004]getaddrinfo失败

某个站点阻止了我，所以我尝试forcebindip为我的python模块使用另一个网络接口，然后我将其删除。这可能会导致我的网络混乱，请求模块甚至直接套接字模块被卡住，无法获取任何url

所以我在下面的URL中遵循了网络配置重置，现在我很好

如果它对其他人有帮助，我会遇到同样的错误信息：

Client-Request-ID=long-string Retry policy did not allow for a retry: , HTTP status code=Unknown, Exception=HTTPSConnectionPool(host='table.table.core.windows.net', port=443): Max retries exceeded with url: /service(PartitionKey='requests',RowKey='9999') (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001D920ADA970>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')).

Client Request ID=long string重试策略不允许重试：，HTTP状态代码=Unknown，Exception=HTTPSConnectionPool（host='table.table.core.windows.net'，port=443）：url:/service（PartitionKey='requests'，RowKey='9999'）超过了最大重试次数（由NewConnectionError引起（'：未能建立新连接：[Errno 11001]getaddrinfo失败（'））。

…尝试使用从Azure表存储检索记录时

表\服务。获取\实体（表\名称、分区\键、行\键）

我的问题：

Client-Request-ID=long-string Retry policy did not allow for a retry: , HTTP status code=Unknown, Exception=HTTPSConnectionPool(host='table.table.core.windows.net', port=443): Max retries exceeded with url: /service(PartitionKey='requests',RowKey='9999') (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000001D920ADA970>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed')).

我的
```
表\u名称定义不正确
```

我再次测试我的代码。我得到了输出，但在遇到错误HTTPConnectionPool时会停止，有没有办法