url超过Python最大重试次数

url超过Python最大重试次数,python,Python,扫描仪会一直工作,直到找到不再可用的外部地址,然后崩溃 我只想扫描herold.at并提取电子邮件地址 我要他停止扫描外部地址。我试过了 r=requests.get('http://github.com“,允许_redirects=False)但不起作用 import csv import requests import re import time from bs4 import BeautifulSoup # Number of pages plus on

扫描仪会一直工作,直到找到不再可用的外部地址,然后崩溃

我只想扫描herold.at并提取电子邮件地址

我要他停止扫描外部地址。我试过了

r=requests.get('http://github.com“,允许_redirects=False)
但不起作用

import csv
    import requests
    import re
    import time
    from bs4 import BeautifulSoup

# Number of pages plus one

allLinks = [];mails=[];
url = 'https://www.herold.at/gelbe-seiten/wien/was_installateur/?page='
for page in range(3):
    time.sleep(5)
    print('---', page, '---')

    
    response = requests.get(url + str(page), timeout=1.001)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
    #time.sleep(15)
    if(("Kontakt" in i or "Porträt")):
        allLinks.append(i)
allLinks=set(allLinks)

def findMails(soup):
    #time.sleep(15)
    for name in soup.find_all("a", "ellipsis"):
        if(name is not None):
            emailText=name.text
            match=bool(re.match('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+',emailText))
            if('@' in emailText and match==True):
                emailText=emailText.replace(" ",'').replace('\r','')
                emailText=emailText.replace('\n','').replace('\t','')
                if(len(mails)==0)or(emailText not in mails):
                         print(emailText)
                         mails.append(emailText)

for link in allLinks:
   if(link.startswith("http") or link.startswith("www")):
        r=requests.get(link)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

   else:
        newurl=url+link
        r=requests.get(newurl)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")
错误:


requests.exceptions.ConnectionError:HTTPConnectionPool(host='www.gebrueder-lamberger.at',port=80):url:/(由NewConnectionError引起)(':无法建立新连接:[WinError 10060]连接尝试失败,因为连接方在一段时间后没有正确响应,或者建立的连接失败,因为连接的主机没有响应。)

如果(link.startswith(“http”)或link.startswith(“www”)出现错误,则错误在这一行
http
更改为
https
,它应该可以工作。我试过了,它收到了所有的电子邮件

--- 0 ---
--- 1 ---
--- 2 ---
office@smutny-installationen.at
office@offnerwien.at
office@remes-gmbh.at
wien13@lugar.at
office@rossbacher-at.com
office@weiner-gmbh.at
office@wojtek-installateur.at
office@b-gas.at
office@blasl-gmbh.at
gsht@aon.at
office@ertl-installationen.at
office@jakubek.co.at
office@peham-installateur.at
office@installateur-weber.co.at
office@gebrueder-lamberger.at
office@ar-allround-installationen.at

另外,您可以尝试设置您的流媒体池。

听起来您需要添加一些代码以从
所有链接中过滤出不需要的链接,对吗?对,我不需要扫描外部域。我只需要herold.atYes。您已经有了通过地址递归的代码。是什么阻止您在那里添加筛选器?此外,您的代码没有防止重新扫描已扫描的URL的措施。我不知道要添加什么。您可以帮助我吗?not work requests.exceptions.ConnectionError:HTTPSConnectionPool(host='www.installateurexperte.at',port=443)