Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/html/76.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用BeautifulSoup、循环和跳过特定URL值的Python web抓取_Python_Html_Python 2.7_Web Scraping_Beautifulsoup - Fatal编程技术网

使用BeautifulSoup、循环和跳过特定URL值的Python web抓取

使用BeautifulSoup、循环和跳过特定URL值的Python web抓取,python,html,python-2.7,web-scraping,beautifulsoup,Python,Html,Python 2.7,Web Scraping,Beautifulsoup,所以我使用下面的代码从一个站点上刮取雕像 from bs4 import BeautifulSoup import requests f = open('C:\Python27\projects\FL_final.doc','w') base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02

所以我使用下面的代码从一个站点上刮取雕像

from bs4 import BeautifulSoup
import requests


f = open('C:\Python27\projects\FL_final.doc','w')

base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"

for chapter in range (1,9):
  url = base_url.format(chapter=chapter)
  r = requests.get(url)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })
  for title in tableContents.find_all ('div', {'class': 'Title' }):
    f.write (title.text)

   for data in tableContents.find_all('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n\n" + str(data)+ "\n" 
      f.write(data)

f.close()   
问题是缺少某些章节。例如,有第1章到第2章的页面,那么第3、4、5章的页面就不存在了。因此,当使用范围(1,9)时,它会给我错误,因为它无法获取第3,4,5章的内容,因为它们的(0003/0003,0004/0004,0005/0005)url不存在

如何跳过循环中缺少的URL,让程序找到范围内的下一个可用URL


这是第1章的url:

您可以检查是否找到tableContents,例如:


    tableContents = soup.find('div', {'class': 'Chapters' })
    if tableContents: 
        for title in tableContents.find_all ('div', {'class': 'Title' }):
            f.write (title.text)

您可以检查是否找到tableContents,例如:


    tableContents = soup.find('div', {'class': 'Chapters' })
    if tableContents: 
        for title in tableContents.find_all ('div', {'class': 'Title' }):
            f.write (title.text)

您可以为url请求添加
try
,并在应用
find\u all
之前检查
tableContents是否为none

import requests

f = open('C:\Python27\projects\FL_final.doc','w')

base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"

for chapter in range (1,9):
  url = base_url.format(chapter=chapter)
  try:
    r = requests.get(url)
  except requests.exceptions.RequestException as e:    # This is the correct syntax
      print "missing url"
      print e
      sys.exit(1)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })

  if tableContents is not None:
    for title in tableContents.find_all ('div', {'class': 'Title' }):
      f.write (title.text)

    for data in tableContents.find_all('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n\n" + str(data)+ "\n" 
      print data
      f.write(data)

您可以为url请求添加
try
,并在应用
find\u all
之前检查
tableContents是否为none

import requests

f = open('C:\Python27\projects\FL_final.doc','w')

base_url = "http://www.leg.state.fl.us/statutes/index.cfm?App_mode=Display_Statute&URL=0000-0099/00{chapter:02d}/00{chapter:02d}.html"

for chapter in range (1,9):
  url = base_url.format(chapter=chapter)
  try:
    r = requests.get(url)
  except requests.exceptions.RequestException as e:    # This is the correct syntax
      print "missing url"
      print e
      sys.exit(1)
  soup = BeautifulSoup((r.content),"html.parser")
  tableContents = soup.find('div', {'class': 'Chapters' })

  if tableContents is not None:
    for title in tableContents.find_all ('div', {'class': 'Title' }):
      f.write (title.text)

    for data in tableContents.find_all('div',{'class':'Section' }):
      data = data.text.encode("utf-8","ignore")
      data = "\n\n" + str(data)+ "\n" 
      print data
      f.write(data)

谢谢你的快速回复!代码有效!你能给我解释一下你到底想做什么吗。Requestexception函数有什么作用?它是关于处理异常的(有关更多信息,请参阅)。但在您的案例中,问题与应用于未定义的
tableContents
对象的
find_all
有关(缺少章节)。感谢您的快速回复!代码有效!你能给我解释一下你到底想做什么吗。Requestexception函数有什么作用?它是关于处理异常的(有关更多信息,请参阅)。但在您的案例中,问题与应用于未定义的
tableContents
对象的
find_all
有关(缺少章节)。