Python 使用BeautifulSoup跳过错误404_Python_Beautifulsoup

Python 使用BeautifulSoup跳过错误404

python

Python 使用BeautifulSoup跳过错误404,python,beautifulsoup,Python,Beautifulsoup,我想用BeautifulSoup删除一些URL。我正在抓取的URL来自google analytics API调用，其中一些无法正常工作，因此我需要找到一种方法跳过它们我试图补充这一点： except urllib2.HTTPError: continue 但我得到了以下语法错误： except urllib2.HTTPError: ^ SyntaxError: invalid syntax 这是我的全部代码： rawdata = [] urllist = []

我想用BeautifulSoup删除一些URL。我正在抓取的URL来自google analytics API调用，其中一些无法正常工作，因此我需要找到一种方法跳过它们

我试图补充这一点：

except urllib2.HTTPError:
continue

但我得到了以下语法错误：

    except urllib2.HTTPError:
         ^
SyntaxError: invalid syntax

这是我的全部代码：

rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'
def print_results(results):
  # Print data nicely for the user.

  if results:
    for row in results.get('rows'):
      rawdata.append(row[0])
  else:
    print 'No results found'

  urllist = [mystring + x for x in rawdata]

  for row in urllist:  
            # query the website and return the html to the variable 'page'
    page = urllib2.urlopen(row)
    except urllib2.HTTPError:
    continue
    soup = BeautifulSoup(page, 'html.parser')

                # Take out the <div> of name and get its value
    name_box = soup.find(attrs={'class': 'nb-shares'})
    if name_box is None:
      continue
    share = name_box.text.strip() # strip() is used to remove starting and trailing

    # save the data in tuple
    sharelist.append((row,share))

  print(sharelist)

rawdata=[]
URL列表=[]
sharelist=[]
mystringhttp://www.konbini.com'
def打印结果（结果）：
#为用户很好地打印数据。
如果结果是：
对于results.get中的行（'rows'）：
rawdata.append（第[0]行）
其他：
打印“未找到结果”
urllist=[mystring+x代表rawdata中的x]
对于URL列表中的行：
#查询网站并将html返回到变量“page”
page=urlib2.urlopen（行）
除了urlib2.HTTPError：
持续
soup=BeautifulSoup（页面“html.parser”）
#取出of name并获取其值
name_box=soup.find（attrs={'class'：'nb shares'}）
如果“名称”框为“无”：
持续
share=name_box.text.strip（）#strip（）用于删除起始和结尾
#将数据保存在元组中
sharelist.append（（行，共享））
打印（共享列表）

您的语法错误是由于您的

EXPECT

语句缺少一个

try

try:
    # code that might throw HTTPError
    page = urllib2.urlopen(row)
except urllib2.HTTPError:
    continue

您的

except

语句前面没有

try

语句。您应该使用以下模式：

try:
    page = urllib2.urlopen(row)
except urllib2.HTTPError:
    continue

还要注意缩进级别。必须缩进在try子句下执行的代码，以及except子句下执行的代码。

两个错误：
1.没有

尝试语句

2.无压痕
使用以下命令：
for row in urllist:  
          # query the website and return the html to the variable 'page'
    try:
        page = urllib2.urlopen(row)
    except urllib2.HTTPError:
        continue

正如其他人已经提到的那样
try语句丢失
缺少正确的压痕
您应该使用IDE或编辑器，这样您就不会面临这样的问题，一些好的IDE和编辑器是非常有用的

IDE-Eclipse使用插件
编辑-

不管怎样，代码在尝试和缩进之后
rawdata = []
urllist = []
sharelist = []
mystring = 'http://www.konbini.com'


def print_results(results):
    # Print data nicely for the user.
    if results:
        for row in results.get('rows'):
            rawdata.append(row[0])
    else:
        print 'No results found'
    urllist = [mystring + x for x in rawdata]
    for row in urllist:
        # query the website and return the html to the variable 'page'
        try:
            page = urllib2.urlopen(row)
        except urllib2.HTTPError:
            continue

    soup = BeautifulSoup(page, 'html.parser')
    # Take out the <div> of name and get its value
    name_box = soup.find(attrs={'class': 'nb-shares'})
    if name_box is None:
        continue
    share = name_box.text.strip()  # strip() is used to remove starting and trailing

    # save the data in tuple
    sharelist.append((row, share))

    print(sharelist)

rawdata=[]
URL列表=[]
sharelist=[]
mystringhttp://www.konbini.com'
def打印结果（结果）：
#为用户很好地打印数据。
如果结果是：
对于results.get中的行（'rows'）：
rawdata.append（第[0]行）
其他：
打印“未找到结果”
urllist=[mystring+x代表rawdata中的x]
对于URL列表中的行：
#查询网站并将html返回到变量“page”
尝试：
page=urlib2.urlopen（行）
除了urlib2.HTTPError：
持续
soup=BeautifulSoup（页面“html.parser”）
#取出of name并获取其值
name_box=soup.find（attrs={'class'：'nb shares'}）
如果“名称”框为“无”：
持续
share=name_box.text.strip（）#strip（）用于删除起始和结尾
#将数据保存在元组中
sharelist.append（（行，共享））
打印（共享列表）
如果您只想捕获404，则需要检查返回的代码或引发错误，否则您将捕获并忽略不仅仅是404：
import urllib2
from bs4  import BeautifulSoup
from urlparse import urljoin


def print_results(results):
    base = 'http://www.konbini.com'
    rawdata = []
    sharelist = []
    # Print data nicely for the user.
    if results:
        for row in results.get('rows'):
            rawdata.append(row[0])
    else:
        print 'No results found'
    # use urljoin to join to the base url
    urllist = [urljoin(base, h) for h in rawdata]
    for url in urllist:
        # query the website and return the html to the variable 'page'
        try: # need to open with try
            page = urllib2.urlopen(url)
        except urllib2.HTTPError as e:
            if e.getcode() == 404: # check the return code
                continue
            raise # if other than 404, raise the error

        soup = BeautifulSoup(page, 'html.parser')
        # Take out the <div> of name and get its value
        name_box = soup.find(attrs={'class': 'nb-shares'})
        if name_box is None:
            continue
        share = name_box.text.strip()  # strip() is used to remove starting and trailing

        # save the data in tuple
        sharelist.append((url, share))

    print(sharelist)

导入urllib2
从bs4导入BeautifulSoup
从urlparse导入urljoin
def打印结果（结果）：
基地组织http://www.konbini.com'
原始数据=[]
sharelist=[]
#为用户很好地打印数据。
如果结果是：
对于results.get中的行（'rows'）：
rawdata.append（第[0]行）
其他：
打印“未找到结果”
#使用urljoin连接到基本url
urllist=[urljoin（基数，h）表示原始数据中的h]
对于url列表中的url：
#查询网站并将html返回到变量“page”
try:#需要用try打开
page=urlib2.urlopen（url）
除了urllib2.HTTPError作为e：
如果e.getcode（）==404:#检查返回代码
持续
升起#如果不是404，则升起错误
soup=BeautifulSoup（页面“html.parser”）
#取出of name并获取其值
name_box=soup.find（attrs={'class'：'nb shares'}）
如果“名称”框为“无”：
持续
share=name_box.text.strip（）#strip（）用于删除起始和结尾
#将数据保存在元组中
附加（（url，共享））
打印（共享列表）
您遇到了什么语法错误？我更新了我的问题。我想您是对的，因为现在我的脚本陷入了无限循环@西蒙布雷顿。urllib2.HTTPError将捕获多个异常，您应该只允许捕获预期的异常，任何其他错误都应该引发，或者您可能会错过一些重要的错误，如果我没有执行export LANG=en_US.UTF-8
在运行脚本之前，我遇到了这个“ascii”编解码器错误。但是，如果在运行脚本之前使用export LANG=en_US.UTF-8
，则不会发生任何事情…没有错误，但不会产生结果。。。我的脚本似乎被卡住了…@SimonBreton，那么它没有被卡住，它还在运行。在循环中添加一个打印以查看发生了什么。谢谢。我已经尝试减少url的数量，而且效果很好。