Python 对于循环web抓取,网站会显示timeouterror、newconnectionerror和requests.exceptions.ConnectionError
抱歉,我是Python和Web垃圾的开始 我正在抓取网页以提取输入字符的读数。我制作了一个10273个字符的列表,将其格式化为URL,并打开带有读数的页面,然后我使用Requests模块返回源代码,然后Beauty Soup返回所有音频ID(因为它们的字符串包含输入字符的读数-我无法使用表中出现的文本,因为它们是SVG)。然后我尝试将字符及其读数输出到out.txtPython 对于循环web抓取,网站会显示timeouterror、newconnectionerror和requests.exceptions.ConnectionError,python,web-scraping,beautifulsoup,python-requests,timeoutexception,Python,Web Scraping,Beautifulsoup,Python Requests,Timeoutexception,抱歉,我是Python和Web垃圾的开始 我正在抓取网页以提取输入字符的读数。我制作了一个10273个字符的列表,将其格式化为URL,并打开带有读数的页面,然后我使用Requests模块返回源代码,然后Beauty Soup返回所有音频ID(因为它们的字符串包含输入字符的读数-我无法使用表中出现的文本,因为它们是SVG)。然后我尝试将字符及其读数输出到out.txt # -*- coding: utf-8 -*- import requests, time from bs4 import Bea
# -*- coding: utf-8 -*-
import requests, time
from bs4 import BeautifulSoup
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
characters = [
#characters go here
]
output = open("out.txt", "a", encoding="utf-8")
tic = time.perf_counter()
for char in characters:
# Characters from the list are formatted into the url
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
page = requests.get(url, verify=False)
soup = BeautifulSoup(page.text, 'html.parser')
for audio_tag in soup.find_all('audio'):
audio_id = audio_tag.get('id').replace("0-","")
#output.write(char)
#output.write(" ")
#output.write(audio_id)
#output.write("\n")
print(i)
time.sleep(60)
output.close()
toc = time.perf_counter()
duration = int(toc) - int(tic)
print("Took %d seconds" % duration)
out.txt
是我试图将结果输出到的输出文件。我衡量了衡量绩效的过程所用的时间
但是,经过50次左右的循环后,我在cmd中得到了以下结果:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 169, in _new_conn
conn = connection.create_connection(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 96, in create_connection
raise err
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\connection.py", line 86, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File"C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 699, in urlopen httplib_response = self._make_request(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 382, in _make_request
self._validate_conn(conn)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 1010, in _validate_conn
conn.connect()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 353, in connect
conn = self._new_conn()
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connection.py", line 181, in _new_conn
raise NewConnectionError(urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 439, in send
resp = conn.urlopen(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\connectionpool.py", line 755, in urlopen
retries = retries.increment(
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\urllib3\util\retry.py", line 573, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause)) urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\[user]\Documents\wenzhou-ime\test.py", line 3282, in <module> page = requests.get(url, verify=False)
File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 76, in get
return request('get', url, params=params, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\api.py", line 61, in request
return session.request(method=method, url=url, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 542, in request
resp = self.send(prep, **send_kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\sessions.py", line 655, in send
r = adapter.send(request, **kwargs) File "C:\Users\[user]\Documents\wenzhou-ime\env\lib\site-packages\requests\adapters.py", line 516, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='wugniu.com', port=443): Max retries exceeded with url: /search?char=%E8%87%B4&table=wenzhou (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x000002035D5F9040>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
回溯(最近一次呼叫最后一次):
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\connection.py”,第169行,位于康涅狄格州新州
conn=连接。创建连接(
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\util\connection.py”,第96行,位于create\u connection中
提出错误
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\util\connection.py”,第86行,位于创建\u连接中
sock.connect(sa)
TimeoutError:[WinError 10060]由于连接方在一段时间后没有正确响应,连接尝试失败;或者由于连接的主机未能响应,建立的连接失败
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\connectionpool.py”,第699行,位于urlopen httplib\u response=self.\u发出请求(
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\connectionpool.py”,第382行,在请求中
自我验证连接(连接)
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\connectionpool.py”,第1010行,在\u validate\u conn中
连接
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\connection.py”,第353行,在connect中
conn=自我。_new_conn()
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urllib3\connection.py”,位于康涅狄格州新州第181行
raise NewConnectionError(urllib3.exceptions.NewConnectionError::未能建立新连接:[WinError 10060]由于连接方在一段时间后没有正确响应,连接尝试失败;或者由于连接的主机未能响应,建立的连接失败
在处理上述异常期间,发生了另一个异常:
回溯(最近一次呼叫最后一次):
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\requests\adapters.py”,第439行,在send中
resp=conn.urlopen(
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urlib3\connectionpool.py”,第755行,在urlopen中
重试次数=重试次数。增量(
文件“C:\Users\[user]\Documents\ime\env\lib\site packages\urlib3\util\retry.py”,第573行,增量
raise MaxRetryError(_pool,url,error or ResponseError(cause))urllib3.exceptions.MaxRetryError:HTTPSConnectionPool(host='wugniu.com',port=443):url超过最大重试次数:/search?char=%E8%87%B4&table=温州(由NewConnectionError引起(':未能建立新连接:[WinError 10060]由于连接方在一段时间后没有正确响应,连接尝试失败,或者由于连接的主机未能响应,建立的连接失败
output = open("out.txt", "a", encoding="utf-8")
output.close()
with open('out.txt', 'w', newline='', encoding='utf-8') as output:
# here you can do your operation.
url = "https://wugniu.com/search?char=%s&table=wenzhou" % char
"https://wugniu.com/search?char={}&table=wenzhou".format(char)
import requests
from bs4 import BeautifulSoup
import urllib3
urllib3.disable_warnings()
def main(url, chars):
with open('result.txt', 'w', newline='', encoding='utf-8') as f, requests.Session() as req:
req.verify = False
for char in chars:
print(f"Extracting {char}")
r = req.get(url.format(char))
soup = BeautifulSoup(r.text, 'lxml')
target = [x['id'][2:] for x in soup.select('audio[id^="0-"]')]
print(target)
f.write(f'{char}\n{str(target)}\n')
if __name__ == "__main__":
chars = ['核']
main('https://wugniu.com/search?char={}&table=wenzhou', chars)