使用BeautifulSoup、循环和跳过特定URL值的Python web抓取_Python_Html_Python 2.7_Web Scraping_Beautifulsoup

使用BeautifulSoup、循环和跳过特定URL值的Python web抓取

python html python-2.7 web-scraping

使用BeautifulSoup、循环和跳过特定URL值的Python web抓取,python,html,python-2.7,web-scraping,beautifulsoup,Python,Html,Python 2.7,Web Scraping,Beautifulsoup,所以我使用下面的代码从一个站点上刮取雕像 from bs4 import BeautifulSoup import requests f = open('C:\Python27\projects\FL_final.doc','w') base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02

所以我使用下面的代码从一个站点上刮取雕像

from bs4 import BeautifulSoup
import requests


f = open('C:\Python27\projects\FL_final.doc','w')

base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"

for chapter in range (1,9):
  url = base_url.format(chapter=chapter)
  r = requests.get(url)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })
  for title in tableContents.find_all ('div', {'class': 'Title' }):
    f.write (title.text)

   for data in tableContents.find_all('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n\n" + str(data)+ "\n" 
      f.write(data)

f.close()

问题是缺少某些章节。例如，有第1章到第2章的页面，那么第3、4、5章的页面就不存在了。因此，当使用范围（1,9）时，它会给我错误，因为它无法获取第3,4,5章的内容，因为它们的（0003/0003，0004/0004，0005/0005）url不存在

如何跳过循环中缺少的URL，让程序找到范围内的下一个可用URL

这是第1章的url:

您可以检查是否找到tableContents，例如：


    tableContents = soup.find('div', {'class': 'Chapters' })
    if tableContents: 
        for title in tableContents.find_all ('div', {'class': 'Title' }):
            f.write (title.text)

您可以检查是否找到tableContents，例如：


    tableContents = soup.find('div', {'class': 'Chapters' })
    if tableContents: 
        for title in tableContents.find_all ('div', {'class': 'Title' }):
            f.write (title.text)

您可以为url请求添加

try

，并在应用

find\u all

之前检查

tableContents是否为none

：

import requests

f = open('C:\Python27\projects\FL_final.doc','w')

base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"

for chapter in range (1,9):
  url = base_url.format(chapter=chapter)
  try:
    r = requests.get(url)
  except requests.exceptions.RequestException as e:    # This is the correct syntax
      print "missing url"
      print e
      sys.exit(1)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })

  if tableContents is not None:
    for title in tableContents.find_all ('div', {'class': 'Title' }):
      f.write (title.text)

    for data in tableContents.find_all('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n\n" + str(data)+ "\n" 
      print data
      f.write(data)

您可以为url请求添加

try

，并在应用

find\u all

之前检查

tableContents是否为none

：

import requests

f = open('C:\Python27\projects\FL_final.doc','w')

base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"

for chapter in range (1,9):
  url = base_url.format(chapter=chapter)
  try:
    r = requests.get(url)
  except requests.exceptions.RequestException as e:    # This is the correct syntax
      print "missing url"
      print e
      sys.exit(1)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })

  if tableContents is not None:
    for title in tableContents.find_all ('div', {'class': 'Title' }):
      f.write (title.text)

    for data in tableContents.find_all('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n\n" + str(data)+ "\n" 
      print data
      f.write(data)

谢谢你的快速回复！代码有效！你能给我解释一下你到底想做什么吗。Requestexception函数有什么作用？它是关于处理异常的（有关更多信息，请参阅）。但在您的案例中，问题与应用于未定义的

tableContents

对象的

find_all

有关（缺少章节）。感谢您的快速回复！代码有效！你能给我解释一下你到底想做什么吗。Requestexception函数有什么作用？它是关于处理异常的（有关更多信息，请参阅）。但在您的案例中，问题与应用于未定义的

tableContents

对象的

find_all

有关（缺少章节）。