Python 如何跳过500个内部服务器错误并继续使用BeautifulSoup进行Web垃圾处理？_Python_Web Scraping_Beautifulsoup

Python 如何跳过500个内部服务器错误并继续使用BeautifulSoup进行Web垃圾处理？

python web-scraping

Python 如何跳过500个内部服务器错误并继续使用BeautifulSoup进行Web垃圾处理？,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我正在使用PythonBeautifulSoup进行Webscraping 获取错误“HTTP错误500:内部服务器错误” 下面是我的代码 import requests from bs4 import BeautifulSoup import pdb from urllib.request import urlopen import csv from urllib.error import HTTPError for IPRD_ID in range(1,10): url = 'htt

我正在使用Python

BeautifulSoup

进行

Webscraping

获取错误“HTTP错误500:内部服务器错误”

下面是我的代码

import requests
from bs4 import BeautifulSoup
import pdb
from urllib.request import urlopen
import csv
from urllib.error import HTTPError

for IPRD_ID in range(1,10):
   url = 'https://ipr.etsi.org/IPRDetails.aspx?IPRD_ID={}&IPRD_TYPE_ID=2&MODE=2'.format(IPRD_ID)
   page = urlopen(url)
   soup = BeautifulSoup(page, "lxml")
   table = soup.findAll('table', style="width:100%")
   try:
      for tr in table:
          a = (tr.get_text())
   except:
      print('exe')

正如我们所看到的，我正在使用从1到10开始的

range

函数。我一步一步地检查代码。在

IPRD_ID=3

页面服务器错误是没有数据的。因此它创建

500内部错误

由于我们没有看到任何数据，因此它将出现错误

HTTP错误500:内部服务器错误我们已经看到一个

IPRD_ID=3

有错误，如果我给出更多的范围1到100，可能会有更多的错误页面。因此，我想知道如何像这样跳过这些页面，并在您的案例中继续

urlopen（URL）

引发

urllib.error.HTTPError

异常。您可以直接捕获此异常，也可以捕获更多通用异常，如

类异常（BaseException）：pass

。另外，您可以在

HTTP

请求之间进行延迟（在您的情况下，这是非常推荐的），就像在我的代码中一样

import time
import requests
from bs4 import BeautifulSoup
import pdb
import urllib
from urllib.request import urlopen
import csv
from urllib.error import HTTPError

for IPRD_ID in range(1,10):
    url = 'https://ipr.etsi.org/IPRDetails.aspx?IPRD_ID={}&IPRD_TYPE_ID=2&MODE=2'.format(IPRD_ID)
    try:
        page = urlopen(url)
    except urllib.error.HTTPError as exc:
        print('Something went wrong.')
        time.sleep(10) # wait 10 seconds and then make http request again
        continue
    else:
        print('if client get http response, start parsing.')
        soup = BeautifulSoup(page, "lxml")
        table = soup.findAll('table', style="width:100%")
        try:
            for tr in table:
                a = tr.get_text()
        except Exception as exc:
            print('Something went wrong during parsing !!!')
        finally:
            time.sleep(5) # wait 5 seconds if success, and then make HTTP request.

希望它能帮助您。

尝试捕获错误代码，如果遇到错误，请继续

for IPRD_ID in range(1,10):
    url = 'https://ipr.etsi.org/IPRDetails.aspx?IPRD_ID={}&IPRD_TYPE_ID=2&MODE=2'.format(IPRD_ID)
    try:
        page = urlopen(url)
        soup = BeautifulSoup(page, "lxml")
        table = soup.findAll('table', style="width:100%")
        for tr in table:
            a = (tr.get_text())

    except  HTTPError, err:
        if err.code == 500:
            print ("Internal server error 500")
        else:
            print ("Some other error. Error code: ", err.code)

代码中的内部服务器错误在哪里？我猜错误出现在urlopen（url）行？您可以尝试在try catch中将代码从该行包装到代码段的最后一行。在捕获中，只需记录/打印哪个站点给出了500，然后让它继续。