url超过Python最大重试次数
扫描仪会一直工作,直到找到不再可用的外部地址,然后崩溃 我只想扫描herold.at并提取电子邮件地址 我要他停止扫描外部地址。我试过了url超过Python最大重试次数,python,Python,扫描仪会一直工作,直到找到不再可用的外部地址,然后崩溃 我只想扫描herold.at并提取电子邮件地址 我要他停止扫描外部地址。我试过了 r=requests.get('http://github.com“,允许_redirects=False)但不起作用 import csv import requests import re import time from bs4 import BeautifulSoup # Number of pages plus on
r=requests.get('http://github.com“,允许_redirects=False)
但不起作用
import csv
import requests
import re
import time
from bs4 import BeautifulSoup
# Number of pages plus one
allLinks = [];mails=[];
url = 'https://www.herold.at/gelbe-seiten/wien/was_installateur/?page='
for page in range(3):
time.sleep(5)
print('---', page, '---')
response = requests.get(url + str(page), timeout=1.001)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
#time.sleep(15)
if(("Kontakt" in i or "Porträt")):
allLinks.append(i)
allLinks=set(allLinks)
def findMails(soup):
#time.sleep(15)
for name in soup.find_all("a", "ellipsis"):
if(name is not None):
emailText=name.text
match=bool(re.match('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+',emailText))
if('@' in emailText and match==True):
emailText=emailText.replace(" ",'').replace('\r','')
emailText=emailText.replace('\n','').replace('\t','')
if(len(mails)==0)or(emailText not in mails):
print(emailText)
mails.append(emailText)
for link in allLinks:
if(link.startswith("http") or link.startswith("www")):
r=requests.get(link)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
else:
newurl=url+link
r=requests.get(newurl)
data=r.text
soup=BeautifulSoup(data,'html.parser')
findMails(soup)
mails=set(mails)
if(len(mails)==0):
print("NO MAILS FOUND")
错误:
requests.exceptions.ConnectionError:HTTPConnectionPool(host='www.gebrueder-lamberger.at',port=80):url:/(由NewConnectionError引起)(':无法建立新连接:[WinError 10060]连接尝试失败,因为连接方在一段时间后没有正确响应,或者建立的连接失败,因为连接的主机没有响应。)如果(link.startswith(“http”)或link.startswith(“www”)出现错误,则错误在这一行
:
将http
更改为https
,它应该可以工作。我试过了,它收到了所有的电子邮件
--- 0 ---
--- 1 ---
--- 2 ---
office@smutny-installationen.at
office@offnerwien.at
office@remes-gmbh.at
wien13@lugar.at
office@rossbacher-at.com
office@weiner-gmbh.at
office@wojtek-installateur.at
office@b-gas.at
office@blasl-gmbh.at
gsht@aon.at
office@ertl-installationen.at
office@jakubek.co.at
office@peham-installateur.at
office@installateur-weber.co.at
office@gebrueder-lamberger.at
office@ar-allround-installationen.at
另外,您可以尝试设置您的流媒体池。听起来您需要添加一些代码以从
所有链接中过滤出不需要的链接,对吗?对,我不需要扫描外部域。我只需要herold.atYes。您已经有了通过地址递归的代码。是什么阻止您在那里添加筛选器?此外,您的代码没有防止重新扫描已扫描的URL的措施。我不知道要添加什么。您可以帮助我吗?not work requests.exceptions.ConnectionError:HTTPSConnectionPool(host='www.installateurexperte.at',port=443)