Python 使用BeautifulSoup检查下一页是否存在_Python_Web Scraping_Beautifulsoup

Python 使用BeautifulSoup检查下一页是否存在

python web-scraping

Python 使用BeautifulSoup检查下一页是否存在,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我目前正在学习使用BeautifulSoup编写刮刀。到目前为止，我下面的代码运行良好，除了一些问题。首先，解释一下，我目前正在从Fold.it项目中抓取玩家数据。因为有多个页面需要被刮取，所以我一直在使用这个代码块在循环结束时查找下一个页面 next_link = soup.find(class_='active', title='Go to next page') url_next = "http://www.fold.it" + next_link['hre

我目前正在学习使用BeautifulSoup编写刮刀。到目前为止，我下面的代码运行良好，除了一些问题。首先，解释一下，我目前正在从Fold.it项目中抓取玩家数据。因为有多个页面需要被刮取，所以我一直在使用这个代码块在循环结束时查找下一个页面

   next_link = soup.find(class_='active', title='Go to next page')
   url_next = "http://www.fold.it" + next_link['href'] ### problem line???
   print url_next

不幸的是，有时我会得到这样的结果：

从我可以推断，由于某种原因，下一页链接没有被解析。我不确定这是因为特定的网站，我写的代码，还是完全不同的东西。到目前为止，我已经尝试编写代码来检查它是否返回非类型，但仍然会出错

我正在寻找的理想行为是刮到最后一页。但是，如果确实发生错误，请重试该页面。如果我有任何想法、意见或明显的错误，我们将不胜感激

完整代码如下：

import os
import urllib2
import csv
import time
from bs4 import BeautifulSoup

url_next = 'http://www.fold.it/portal/players/s_all'
url_last = ''

today_string = time.strftime('%m_%d_%Y')
location = '/home/' + 'daily_soloist_' + today_string + '.csv'

mode = 'a' if os.path.exists(location) else 'w'
with open(location, mode) as my_csv:
while True:
    soup = BeautifulSoup(urllib2.urlopen(url_next).read(), "lxml")
    if url_next == url_last:
        print "Scraping Complete"
        break

    for row in soup('tr', {'class':'even'}):
        cells = row('td')

  #current rank
        rank = cells[0].text

  #finds first text node - user name
        name = cells[1].a.find(text=True).strip()

  #separates ranking
        rank1, rank2 = cells[1].find_all("span")

  #total global score
        score = row('td')[2].string

        data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]

  #writes to csv
        database = csv.writer(my_csv, delimiter=',')
        database.writerows(data)  


   next_link = soup.find(class_='active', title='Go to next page')
   url_next = "http://www.fold.it" + next_link['href'] ### problem line???
   print url_next

   last_link = soup.find(class_='active', title = 'Go to last page')
   url_last = "http://www.fold.it" + last_link['href']

要解决此问题，可以使用以下方法：try:except:block。您应该添加比我更多的错误处理。如果尝试失败，您不会更改url\u下一个值。但是要小心，如果你在同一页上出现错误，你将陷入无休止的循环

try:
    if url_next == url_last:
        print "Scraping Complete"
        break

    for row in soup('tr', {'class':'even'}):
        cells = row('td')

        #current rank
        rank = cells[0].text

        #finds first text node - user name
        name = cells[1].a.find(text=True).strip()

        #separates ranking
        rank1, rank2 = cells[1].find_all("span")

        #total global score
        score = row('td')[2].string

        data = [[int(str(rank[1:])), name.encode('ascii', 'ignore'), int(str(rank1.text)), int(str(rank2.text)), int(str(score))]]

        #writes to csv
        database = csv.writer(my_csv, delimiter=',')
        database.writerows(data)  


    next_link = soup.find(class_='active', title='Go to next page')
    url_next = "http://www.fold.it" + next_link['href'] ### problem line???

except:  #if the above bombs out, maintain the same url_next
    print "problem with this page, try again"

print url_next

当你遇到这样的错误时，我会将整个页面转储。他们有可能限制您的速率，如果是的话，他们可能会显示一条消息，说明这一点。@jeffcarey“转储页面”是什么意思？是否在出现错误时保存当前页面？我正在尝试学习行话，如果这是一个简单的问题，我很抱歉。这可能意味着要将它保存到一个文件中，但在这种情况下，我的意思是，如果您方便的话，请将它打印到屏幕上。这个想法是你想快速浏览一下，看看网站是否没有提供你通常期望收到的数据。很抱歉没有回复！感谢您提供的Try/Except示例。我终于发现了问题所在，这不是我的代码，而是我从中删除的站点。使用Try/Except允许我绕过空页。