Python 如何捕获请求。get()异常
我正在为yellowpages.com开发一个网络刮板,总体来说似乎运行良好。但是,在对长查询进行分页时,requests.get(url)将随机返回Python 如何捕获请求。get()异常,python,exception-handling,web-scraping,python-requests,yellow-pages,Python,Exception Handling,Web Scraping,Python Requests,Yellow Pages,我正在为yellowpages.com开发一个网络刮板,总体来说似乎运行良好。但是,在对长查询进行分页时,requests.get(url)将随机返回或。有时,我会收到更糟糕的例外情况,例如: requests.exceptions.ConnectionError: HTTPConnectionPool(host='www.yellowpages.com',port=80):最大重试次数 已超出url的限制: /搜索?搜索词=花店和地理位置词=FL和页码=22(由 NewConnectionEr
或
。有时,我会收到更糟糕的例外情况,例如:
requests.exceptions.ConnectionError:
HTTPConnectionPool(host='www.yellowpages.com',port=80):最大重试次数
已超出url的限制:
/搜索?搜索词=花店和地理位置词=FL和页码=22(由
NewConnectionError(':未能建立新连接:
[WinError 10053]软件中止了已建立的连接
在您的主机',)中)
使用time.sleep()似乎可以消除503错误,但404和异常仍然存在问题
我试图找出如何“捕获”各种响应,以便进行更改(等待、更改代理、更改用户代理),然后重试和/或继续。伪代码如下所示:
If error/exception with request.get:
wait and/or change proxy and user agent
retry request.get
else:
pass
在这一点上,我甚至无法通过以下方式捕捉问题:
try:
r = requests.get(url)
except requests.exceptions.RequestException as e:
print (e)
import sys #only added here, because it's not part of my stable code below
sys.exit()
下面是我从何处开始的完整代码:
import requests
from bs4 import BeautifulSoup
import itertools
import csv
# Search criteria
search_terms = ["florists", "pharmacies"]
search_locations = ['CA', 'FL']
# Structure for Data
answer_list = []
csv_columns = ['Name', 'Phone Number', 'Street Address', 'City', 'State', 'Zip Code']
# Turns list of lists into csv file
def write_to_csv(csv_file, csv_columns, answer_list):
with open(csv_file, 'w') as csvfile:
writer = csv.writer(csvfile, lineterminator='\n')
writer.writerow(csv_columns)
writer.writerows(answer_list)
# Creates url from search criteria and current page
def url(search_term, location, page_number):
template = 'http://www.yellowpages.com/search?search_terms={search_term}&geo_location_terms={location}&page={page_number}'
return template.format(search_term=search_term, location=location, page_number=page_number)
# Finds all the contact information for a record
def find_contact_info(record):
holder_list = []
name = record.find(attrs={'class': 'business-name'})
holder_list.append(name.text if name is not None else "")
phone_number = record.find(attrs={'class': 'phones phone primary'})
holder_list.append(phone_number.text if phone_number is not None else "")
street_address = record.find(attrs={'class': 'street-address'})
holder_list.append(street_address.text if street_address is not None else "")
city = record.find(attrs={'class': 'locality'})
holder_list.append(city.text if city is not None else "")
state = record.find(attrs={'itemprop': 'addressRegion'})
holder_list.append(state.text if state is not None else "")
zip_code = record.find(attrs={'itemprop': 'postalCode'})
holder_list.append(zip_code.text if zip_code is not None else "")
return holder_list
# Main program
def main():
for search_term, search_location in itertools.product(search_terms, search_locations):
i = 0
while True:
i += 1
url = url(search_term, search_location, i)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
main = soup.find(attrs={'class': 'search-results organic'})
page_nav = soup.find(attrs={'class': 'pagination'})
records = main.find_all(attrs={'class': 'info'})
for record in records:
answer_list.append(find_contact_info(record))
if not page_nav.find(attrs={'class': 'next ajax-page'}):
csv_file = "YP_" + search_term + "_" + search_location + ".csv"
write_to_csv(csv_file, csv_columns, answer_list) # output data to csv file
break
if __name__ == '__main__':
main()
提前感谢您花时间阅读这篇长文章/回复:)像这样的东西怎么样
try:
req = ..
if req.status_code == 503:
pass
elif ..:
pass
else:
do something when request succeeds
except ConnectionError:
pass
你可以试试这个
尝试:
#做点什么
除requests.exceptions.ConnectionError作为例外外:
#处理newConnectionError异常
例外情况除外:
#处理任何异常
我一直在做类似的事情,这对我来说很有效(主要是):
#用于处理对网页的请求
导入请求
从请求\u协商\u sspi导入HttpNegotiateAuth
#测试结果,每个要测试的URL 1条记录
w=打开(r'C:\Temp\URL\u Test\u Results.txt,'w')
#仅适用于错误
err=open(r'C:\Temp\URL\u Test\u Error\u Log.txt,'w')
打印('启动进程')
def测试url(url):
#测试URL并将结果写入日志文件。
#必须禁用警告,通过关闭“验证”选项,将生成警告作为
#未检查网站证书,因此结果可能“不好”。主站点抛出错误
#如果我们不关闭它,则会将其写入每个测试的日志中。
requests.packages.urllib3.disable_warnings()
headers={'User-Agent':'Mozilla/5.0(X11;OpenBSD i386)AppleWebKit/537.36(KHTML,像Gecko)Chrome/36.0.1985.125 Safari/537.36'}
打印('测试'+url)
#尝试网站链接,检查错误。
尝试:
response=requests.get(url,auth=HttpNegotiateAuth(),verify=False,headers=headers,timeout=5)
除了requests.exceptions.HTTPError作为e:
打印('HTTP错误')
打印(e)
w、 写入('HTTP错误,检查错误日志'+'\n')
写入错误('HTTP错误'+'\n'+url+'\n'+e+'\n'+'*********'+'\n'+'\n')
除requests.exceptions.ConnectionError外,如e:
#一些外部网站通过这个链接,即使链接是通过浏览器工作的
#我怀疑有一些阻碍在适当的地方,以防止刮擦。。。
#我也许可以设法解决这个问题。
打印('连接错误')
打印(e)
w、 写入('连接错误,检查错误日志'+'\n')
写入错误(str('Connection Error')+'\n'+url+'\n'+str(e)+'\n'+'*********'+'\n'+'\n')
除了requests.exceptions.RequestException作为e:
#任何其他错误类型
打印('其他错误')
打印(e)
w、 写入('未知错误'+'\n')
写入错误('未知错误'+'\n'+url+'\n'+e+'\n'+'*********'+'\n'+'\n')
其他:
#注意,404仍然是“成功的”,因为我们得到了一个有效的响应,所以它在这里通过
#上面没有一个例外。
response=requests.get(url,auth=HttpNegotiateAuth(),verify=False)
打印(响应状态\ U代码)
w、 写入(str(响应状态代码)+'\n')
打印('成功!响应代码:',响应状态\代码)
打印('===============================')
测试url('https://stackoverflow.com/')
我目前在某些网站超时方面仍然存在一些问题,您可以按照我的尝试在此处解决这些问题:
他在帖子中提到,他已经尝试过了,但实际上并不奏效