Python Can';当脚本无法从网页中获取标题时,不要强制脚本尝试几次
我编写了一个脚本,从一些相同的网页上获取不同商店的名称。剧本写得很好 我现在尝试在脚本中创建一个逻辑,如果它无法从那些页面中获取标题,那么它可以尝试几次 作为测试,如果我用选择器定义行,则在Python Can';当脚本无法从网页中获取标题时,不要强制脚本尝试几次,python,python-3.x,web-scraping,beautifulsoup,python-requests,Python,Python 3.x,Web Scraping,Beautifulsoup,Python Requests,我编写了一个脚本,从一些相同的网页上获取不同商店的名称。剧本写得很好 我现在尝试在脚本中创建一个逻辑,如果它无法从那些页面中获取标题,那么它可以尝试几次 作为测试,如果我用选择器定义行,则在name=soup.select_one(“.sales info>h”).text中,脚本将无限循环 到目前为止,我一直在尝试: import requests from bs4 import BeautifulSoup links = ( 'https://www.yellowpages.com
name=soup.select_one(“.sales info>h”).text
中,脚本将无限循环
到目前为止,我一直在尝试:
import requests
from bs4 import BeautifulSoup
links = (
'https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933',
'https://www.yellowpages.com/nationwide/mip/credo-452182701'
)
def get_title(s,link):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
try:
name = soup.select_one(".sales-info > h1").text
except Exception:
print("trying again")
return get_title(s,link) #I wish to bring about any change here to let the script try few times other than trying indefinitely
return name
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
for link in links:
print(get_title(s,link))
当脚本无法从网页中获取标题时,如何让脚本尝试几次?
PS我在脚本中使用的网页是占位符。我认为最简单的方法是从递归切换到循环:
def get_title(s,link):
failed = 0
while failed < 5:
try:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
name = soup.select_one(".sales-info > h1").text
return name
except Exception: # Best to specify which one, by the way
failed += 1
print('Failed too many times')
return None
def get_标题,链接:
失败=0
当故障小于5时:
尝试:
r=s.get(链接)
汤=美汤(右文本,“lxml”)
名称=汤。选择一个(“.sales info>h1”)。文本
返回名称
例外情况除外:#顺便说一下,最好指定哪一个
失败+=1
打印('失败次数太多')
一无所获
我添加了一些参数来指定重试次数、两次重试之间的休眠以及在所有操作都失败时返回的默认值:
import time
import requests
from bs4 import BeautifulSoup
links = (
'https://www.webscraper.io/test-sites/e-commerce/allinone',
'https://www.webscraper.io/test-sites/e-commerce/static'
)
def get_title(s, link, retries=3, sleep=1, default=''):
"""
s -> session
link -> url
retries -> number of retries before return default value
sleep -> sleep between tries (in seconds)
default -> default value to return if every retry fails
"""
name, current_retry = default, 0
while current_retry != retries:
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
try:
name = soup.select_one("h8").text
except Exception:
print("Retry {}/{}".format(current_retry + 1, retries))
time.sleep(sleep)
current_retry += 1
return name
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
for link in links:
print(get_title(s, link, 3, 1, 'Failed to grab {}'.format(link)))
印刷品:
Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/allinone
Retry 1/3
Retry 2/3
Retry 3/3
Failed to grab https://www.webscraper.io/test-sites/e-commerce/static
你可以通过不同的方式实现同样的目标。这里有另外一个你可能想尝试的:
import time
import requests
from bs4 import BeautifulSoup
links = [
"https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933",
"https://www.yellowpages.com/nationwide/mip/credo-452182701"
]
def get_title(s,link,counter=0):
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
try:
name = soup.select_one(".sales-info > h").text
except Exception:
if counter<=3:
time.sleep(1)
print("done trying {} times".format(counter))
counter += 1
return get_title(s,link,counter)
else:
return None
return name
if __name__ == '__main__':
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
for link in links:
print(get_title(s,link))
导入时间
导入请求
从bs4导入BeautifulSoup
链接=[
"https://www.yellowpages.com/san-francisco-ca/mip/nizarios-pizza-481135933",
"https://www.yellowpages.com/nationwide/mip/credo-452182701"
]
def get_标题(s、链接、计数器=0):
r=s.get(链接)
汤=美汤(右文本,“lxml”)
尝试:
名称=汤。选择一个(“.sales info>h”)。文本
除例外情况外:
如果计数器您可以尝试使用任何重试库,例如。请注意,这些库通常作为函数运行,您的函数只需进行导入,然后以类似的方式调用decorator即可:
import requests
from bs4 import BeautifulSoup
from tenacity import retry ###or import backoff
...
@retry ###or @backoff.on_exception(backoff.expo, requests.exceptions.RequestException)
def get_title(s, link, retries=3, sleep=1, default=''):
...