Python Can'；当脚本无法从网页中获取标题时，不要强制脚本尝试几次_Python_Python 3.x_Web Scraping_Beautifulsoup_Python Requests

Python Can'；当脚本无法从网页中获取标题时，不要强制脚本尝试几次

python python-3.x web-scraping

Python Can'；当脚本无法从网页中获取标题时，不要强制脚本尝试几次,python,python-3.x,web-scraping,beautifulsoup,python-requests,Python,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,我编写了一个脚本，从一些相同的网页上获取不同商店的名称。剧本写得很好我现在尝试在脚本中创建一个逻辑，如果它无法从那些页面中获取标题，那么它可以尝试几次作为测试，如果我用选择器定义行，则在name=soup.select_one（“.sales info>h”）.text中，脚本将无限循环到目前为止，我一直在尝试： import requests from bs4 import BeautifulSoup links = ( 'https://www.yellowpages.com

我编写了一个脚本，从一些相同的网页上获取不同商店的名称。剧本写得很好

我现在尝试在脚本中创建一个逻辑，如果它无法从那些页面中获取标题，那么它可以尝试几次

作为测试，如果我用选择器定义行，则在

name=soup.select_one（“.sales info>h”）.text

中，脚本将无限循环

到目前为止，我一直在尝试：

import requests
from bs4 import BeautifulSoup

links = (
    'https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933',
    'https://www.yellowpages.com/nationwide/mip/credo-452182701'
)

def get_title(s,link):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    try:
        name = soup.select_one(".sales-info > h1").text
    except Exception:
        print("trying again")
        return get_title(s,link) #I wish to bring about any change here to let the script try few times other than trying indefinitely

    return name

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
        for link in links:
            print(get_title(s,link))

当脚本无法从网页中获取标题时，如何让脚本尝试几次？

PS我在脚本中使用的网页是占位符。

我认为最简单的方法是从递归切换到循环：

def get_title(s,link):
    failed = 0
    while failed < 5:
        try:
            r = s.get(link)
            soup = BeautifulSoup(r.text,"lxml")
            name = soup.select_one(".sales-info > h1").text
            return name
        except Exception: # Best to specify which one, by the way
            failed += 1
    print('Failed too many times')
    return None

def get_标题，链接：
失败=0
当故障小于5时：
尝试：
r=s.get（链接）
汤=美汤（右文本，“lxml”）
名称=汤。选择一个（“.sales info>h1”）。文本
返回名称
例外情况除外：#顺便说一下，最好指定哪一个
失败+=1
打印（'失败次数太多'）
一无所获

我添加了一些参数来指定重试次数、两次重试之间的休眠以及在所有操作都失败时返回的默认值：

import time
import requests
from bs4 import BeautifulSoup


links = (
    'https://www.webscraper.io/test-sites/e-commerce/allinone',
    'https://www.webscraper.io/test-sites/e-commerce/static'
)

def get_title(s, link, retries=3, sleep=1, default=''):
    """
        s       -> session
        link    -> url
        retries -> number of retries before return default value
        sleep   -> sleep between tries (in seconds)
        default -> default value to return if every retry fails
    """

    name, current_retry = default, 0
    while current_retry != retries:
        r = s.get(link)
        soup = BeautifulSoup(r.text,"lxml")
        try:
            name = soup.select_one("h8").text
        except Exception:
            print("Retry {}/{}".format(current_retry + 1, retries))
            time.sleep(sleep)
            current_retry += 1

    return name

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
        for link in links:
            print(get_title(s, link, 3, 1, 'Failed to grab {}'.format(link)))

印刷品：

Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/allinone
Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/static

你可以通过不同的方式实现同样的目标。这里有另外一个你可能想尝试的：

import time
import requests
from bs4 import BeautifulSoup

links = [
    "https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933",
    "https://www.yellowpages.com/nationwide/mip/credo-452182701"
]

def get_title(s,link,counter=0):
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    try:
        name = soup.select_one(".sales-info > h").text
    except Exception:
        if counter<=3:
            time.sleep(1)
            print("done trying {} times".format(counter))
            counter += 1
            return get_title(s,link,counter)
        else:
            return None

    return name

if __name__ == '__main__':
    with requests.Session() as s:
        s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
        for link in links:
            print(get_title(s,link))

导入时间
导入请求
从bs4导入BeautifulSoup
链接=[
"https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933",
"https://www.yellowpages.com/nationwide/mip/credo-452182701"
]
def get_标题（s、链接、计数器=0）：
r=s.get（链接）
汤=美汤（右文本，“lxml”）
尝试：
名称=汤。选择一个（“.sales info>h”）。文本
除例外情况外：
如果计数器您可以尝试使用任何重试库，例如。请注意，这些库通常作为函数运行，您的函数只需进行导入，然后以类似的方式调用decorator即可：

import requests
from bs4 import BeautifulSoup
from tenacity import retry ###or import backoff

...

@retry ###or @backoff.on_exception(backoff.expo, requests.exceptions.RequestException)
def get_title(s, link, retries=3, sleep=1, default=''):
...