使用BeautifulSoup、循环和跳过特定URL值的Python web抓取
所以我使用下面的代码从一个站点上刮取雕像使用BeautifulSoup、循环和跳过特定URL值的Python web抓取,python,html,python-2.7,web-scraping,beautifulsoup,Python,Html,Python 2.7,Web Scraping,Beautifulsoup,所以我使用下面的代码从一个站点上刮取雕像 from bs4 import BeautifulSoup import requests f = open('C:\Python27\projects\FL_final.doc','w') base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02
from bs4 import BeautifulSoup
import requests
f = open('C:\Python27\projects\FL_final.doc','w')
base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range (1,9):
url = base_url.format(chapter=chapter)
r = requests.get(url)
soup = BeautifulSoup((r.content),"html.parser")
tableContents = soup.find('div', {'class': 'Chapters' })
for title in tableContents.find_all ('div', {'class': 'Title' }):
f.write (title.text)
for data in tableContents.find_all('div',{'class':'Section' }):
data = data.text.encode("utf-8","ignore")
data = "\n\n" + str(data)+ "\n"
f.write(data)
f.close()
问题是缺少某些章节。例如,有第1章到第2章的页面,那么第3、4、5章的页面就不存在了。因此,当使用范围(1,9)时,它会给我错误,因为它无法获取第3,4,5章的内容,因为它们的(0003/0003,0004/0004,0005/0005)url不存在
如何跳过循环中缺少的URL,让程序找到范围内的下一个可用URL
这是第1章的url:您可以检查是否找到tableContents,例如:
tableContents = soup.find('div', {'class': 'Chapters' })
if tableContents:
for title in tableContents.find_all ('div', {'class': 'Title' }):
f.write (title.text)
您可以检查是否找到tableContents,例如:
tableContents = soup.find('div', {'class': 'Chapters' })
if tableContents:
for title in tableContents.find_all ('div', {'class': 'Title' }):
f.write (title.text)
您可以为url请求添加
try
,并在应用find\u all
之前检查tableContents是否为none
:
import requests
f = open('C:\Python27\projects\FL_final.doc','w')
base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range (1,9):
url = base_url.format(chapter=chapter)
try:
r = requests.get(url)
except requests.exceptions.RequestException as e: # This is the correct syntax
print "missing url"
print e
sys.exit(1)
soup = BeautifulSoup((r.content),"html.parser")
tableContents = soup.find('div', {'class': 'Chapters' })
if tableContents is not None:
for title in tableContents.find_all ('div', {'class': 'Title' }):
f.write (title.text)
for data in tableContents.find_all('div',{'class':'Section' }):
data = data.text.encode("utf-8","ignore")
data = "\n\n" + str(data)+ "\n"
print data
f.write(data)
您可以为url请求添加
try
,并在应用find\u all
之前检查tableContents是否为none
:
import requests
f = open('C:\Python27\projects\FL_final.doc','w')
base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"
for chapter in range (1,9):
url = base_url.format(chapter=chapter)
try:
r = requests.get(url)
except requests.exceptions.RequestException as e: # This is the correct syntax
print "missing url"
print e
sys.exit(1)
soup = BeautifulSoup((r.content),"html.parser")
tableContents = soup.find('div', {'class': 'Chapters' })
if tableContents is not None:
for title in tableContents.find_all ('div', {'class': 'Title' }):
f.write (title.text)
for data in tableContents.find_all('div',{'class':'Section' }):
data = data.text.encode("utf-8","ignore")
data = "\n\n" + str(data)+ "\n"
print data
f.write(data)
谢谢你的快速回复!代码有效!你能给我解释一下你到底想做什么吗。Requestexception函数有什么作用?它是关于处理异常的(有关更多信息,请参阅)。但在您的案例中,问题与应用于未定义的
tableContents
对象的find_all
有关(缺少章节)。感谢您的快速回复!代码有效!你能给我解释一下你到底想做什么吗。Requestexception函数有什么作用?它是关于处理异常的(有关更多信息,请参阅)。但在您的案例中,问题与应用于未定义的tableContents
对象的find_all
有关(缺少章节)。