url超过Python最大重试次数_Python

url超过Python最大重试次数

python

url超过Python最大重试次数,python,Python,扫描仪会一直工作，直到找到不再可用的外部地址，然后崩溃我只想扫描herold.at并提取电子邮件地址我要他停止扫描外部地址。我试过了 r=requests.get（'http://github.com“，允许_redirects=False）但不起作用 import csv import requests import re import time from bs4 import BeautifulSoup # Number of pages plus on

扫描仪会一直工作，直到找到不再可用的外部地址，然后崩溃

我只想扫描herold.at并提取电子邮件地址

我要他停止扫描外部地址。我试过了

r=requests.get（'http://github.com“，允许_redirects=False）

但不起作用

import csv
    import requests
    import re
    import time
    from bs4 import BeautifulSoup

# Number of pages plus one

allLinks = [];mails=[];
url = 'https://www.herold.at/gelbe-seiten/wien/was_installateur/?page='
for page in range(3):
    time.sleep(5)
    print('---', page, '---')

    
    response = requests.get(url + str(page), timeout=1.001)
soup=BeautifulSoup(response.text,'html.parser')
links = [a.attrs.get('href') for a in soup.select('a[href]') ]
for i in links:
    #time.sleep(15)
    if(("Kontakt" in i or "Porträt")):
        allLinks.append(i)
allLinks=set(allLinks)

def findMails(soup):
    #time.sleep(15)
    for name in soup.find_all("a", "ellipsis"):
        if(name is not None):
            emailText=name.text
            match=bool(re.match('[a-zA-Z0-9-_.]+@[a-zA-Z0-9-_.]+',emailText))
            if('@' in emailText and match==True):
                emailText=emailText.replace(" ",'').replace('\r','')
                emailText=emailText.replace('\n','').replace('\t','')
                if(len(mails)==0)or(emailText not in mails):
                         print(emailText)
                         mails.append(emailText)

for link in allLinks:
   if(link.startswith("http") or link.startswith("www")):
        r=requests.get(link)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

   else:
        newurl=url+link
        r=requests.get(newurl)
        data=r.text
        soup=BeautifulSoup(data,'html.parser')
        findMails(soup)

mails=set(mails)
if(len(mails)==0):
    print("NO MAILS FOUND")

错误：

requests.exceptions.ConnectionError:HTTPConnectionPool（host='www.gebrueder-lamberger.at'，port=80）：url:/（由NewConnectionError引起）（'：无法建立新连接：[WinError 10060]连接尝试失败，因为连接方在一段时间后没有正确响应，或者建立的连接失败，因为连接的主机没有响应。）

如果（link.startswith（“http”）或link.startswith（“www”）出现错误，则错误在这一行

：

将

http

更改为

https

，它应该可以工作。我试过了，它收到了所有的电子邮件

--- 0 ---
--- 1 ---
--- 2 ---
office@smutny-installationen.at
office@offnerwien.at
office@remes-gmbh.at
wien13@lugar.at
office@rossbacher-at.com
office@weiner-gmbh.at
office@wojtek-installateur.at
office@b-gas.at
office@blasl-gmbh.at
gsht@aon.at
office@ertl-installationen.at
office@jakubek.co.at
office@peham-installateur.at
office@installateur-weber.co.at
office@gebrueder-lamberger.at
office@ar-allround-installationen.at

另外，您可以尝试设置您的流媒体池。

听起来您需要添加一些代码以从

所有链接中过滤出不需要的链接，对吗？对，我不需要扫描外部域。我只需要herold.atYes。您已经有了通过地址递归的代码。是什么阻止您在那里添加筛选器？此外，您的代码没有防止重新扫描已扫描的URL的措施。我不知道要添加什么。您可以帮助我吗？not work requests.exceptions.ConnectionError:HTTPSConnectionPool（host='www.installateurexperte.at'，port=443）