Python Can';当脚本无法从网页中获取标题时,不要强制脚本尝试几次

Python Can';当脚本无法从网页中获取标题时,不要强制脚本尝试几次,python,python-3.x,web-scraping,beautifulsoup,python-requests,Python,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,我编写了一个脚本,从一些相同的网页上获取不同商店的名称。剧本写得很好 我现在尝试在脚本中创建一个逻辑,如果它无法从那些页面中获取标题,那么它可以尝试几次 作为测试,如果我用选择器定义行,则在name=soup.select_one(“.sales info>h”).text中,脚本将无限循环 到目前为止,我一直在尝试: import requests from bs4 import BeautifulSoup links = ( 'https://www.yellowpages.com

我编写了一个脚本,从一些相同的网页上获取不同商店的名称。剧本写得很好

我现在尝试在脚本中创建一个逻辑,如果它无法从那些页面中获取标题,那么它可以尝试几次

作为测试,如果我用选择器定义行,则在
name=soup.select_one(“.sales info>h”).text
中,脚本将无限循环

到目前为止,我一直在尝试:

import requests
from bs4 import BeautifulSoup

links = (
    'https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933',
    'https://www.yellowpages.com/nationwide/mip/credo-452182701'
)

def get_title(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    try:
        name = soup.select_one(".sales-info > h1").text
    except Exception:
        print("trying again")
        return get_title(s,link) #I wish to bring about any change here to let the script try few times other than trying indefinitely

    return name

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
        for link in links:
            print(get_title(s,link))
当脚本无法从网页中获取标题时,如何让脚本尝试几次?


PS我在脚本中使用的网页是占位符。

我认为最简单的方法是从递归切换到循环:

def get_title(s,link):
    failed = 0
    while failed < 5:
        try:
            r = s.get(link)
            soup = BeautifulSoup(r.text,"lxml")
            name = soup.select_one(".sales-info > h1").text
            return name
        except Exception: # Best to specify which one, by the way
            failed += 1
    print('Failed too many times')
    return None
def get_标题,链接:
失败=0
当故障小于5时:
尝试:
r=s.get(链接)
汤=美汤(右文本,“lxml”)
名称=汤。选择一个(“.sales info>h1”)。文本
返回名称
例外情况除外:#顺便说一下,最好指定哪一个
失败+=1
打印('失败次数太多')
一无所获

我添加了一些参数来指定重试次数、两次重试之间的休眠以及在所有操作都失败时返回的默认值:

import time
import requests
from bs4 import BeautifulSoup


links = (
    'https://www.webscraper.io/test-sites/e-commerce/allinone',
    'https://www.webscraper.io/test-sites/e-commerce/static'
)

def get_title(s, link, retries=3, sleep=1, default=''):
    """
        s       -> session
        link    -> url
        retries -> number of retries before return default value
        sleep   -> sleep between tries (in seconds)
        default -> default value to return if every retry fails
    """

    name, current_retry = default, 0
    while current_retry != retries:
        r = s.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        try:
            name = soup.select_one("h8").text
        except Exception:
            print("Retry {}/{}".format(current_retry + 1, retries))
            time.sleep(sleep)
            current_retry += 1

    return name

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
        for link in links:
            print(get_title(s, link, 3, 1, 'Failed to grab {}'.format(link)))
印刷品:

Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/allinone
Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/static

你可以通过不同的方式实现同样的目标。这里有另外一个你可能想尝试的:

import time
import requests
from bs4 import BeautifulSoup

links = [
    "https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933",
    "https://www.yellowpages.com/nationwide/mip/credo-452182701"
]

def get_title(s,link,counter=0):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    try:
        name = soup.select_one(".sales-info > h").text
    except Exception:
        if counter<=3:
            time.sleep(1)
            print("done trying {} times".format(counter))
            counter += 1
            return get_title(s,link,counter)
        else:
            return None

    return name

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
        for link in links:
            print(get_title(s,link))
导入时间
导入请求
从bs4导入BeautifulSoup
链接=[
"https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933",
"https://www.yellowpages.com/nationwide/mip/credo-452182701"
]
def get_标题(s、链接、计数器=0):
r=s.get(链接)
汤=美汤(右文本,“lxml”)
尝试:
名称=汤。选择一个(“.sales info>h”)。文本
除例外情况外:

如果计数器您可以尝试使用任何重试库,例如。请注意,这些库通常作为函数运行,您的函数只需进行导入,然后以类似的方式调用decorator即可:


import requests
from bs4 import BeautifulSoup
from tenacity import retry ###or import backoff

...

@retry ###or @backoff.on_exception(backoff.expo, requests.exceptions.RequestException)
def get_title(s, link, retries=3, sleep=1, default=''):
...