Python 用BeautifulSoup标记网站以下载其上的所有文档会引发IOError_Python_Web Scraping_Beautifulsoup

Python 用BeautifulSoup标记网站以下载其上的所有文档会引发IOError

python web-scraping

Python 用BeautifulSoup标记网站以下载其上的所有文档会引发IOError,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,您好，我想通过Python、Julia或任何语言脚本下载以下网站（）上发布的所有文件。它曾经是一个http网站，在那里，BeautifulSoup运行良好；它现在是一个https网站在我的代码是不幸的不再工作我想要下载的所有文件都在“a”标签中，属于“下载”类。因此，代码中不起作用的行如下所示： fileDownloader.retrieve(document_url, "forecasted-demand-files/"+document_name) 这会引发以下错误： raise IOE

您好，我想通过Python、Julia或任何语言脚本下载以下网站（）上发布的所有文件。它曾经是一个http网站，在那里，BeautifulSoup运行良好；它现在是一个https网站在我的代码是不幸的不再工作

我想要下载的所有文件都在“a”标签中，属于“下载”类。因此，代码中不起作用的行如下所示：

fileDownloader.retrieve(document_url, "forecasted-demand-files/"+document_name)

这会引发以下错误：

raise IOError, ('http error', errcode, errmsg, headers)
IOError: ('http error', 403, 'Forbidden', <httplib.HTTPMessage instance at 0x104f79e60>)

这不是一个https问题，只是您试图刮取的页面有一些文件访问限制。在期望异常时处理异常是一种很好的做法。在这种情况下，所有文件链接都可能断开或无法访问

请尝试按如下方式处理异常：

import requests
import urllib
import re

from bs4 import BeautifulSoup

page = requests.get("https://www.nationalgrid.com/uk/electricity/market-and-operational-data/data-explorer")
soup = BeautifulSoup(page.content, 'html.parser')

fileDownloader = urllib.URLopener()
mainLocation = "https://www.nationalgrid.com"

for document in soup.find_all('a', class_='download'):

    document_name = document["title"]
    document_url = mainLocation+document["href"]
    try:
        fileDownloader.retrieve(document_url, "forecasted-demand-files/"+document_name)
    except IOError as e:
        print('failed to download: {}'.format(document_url))

这个问题的问题是，为了满足请求，您应该将代理作为头传递

我不知道如何使用

urllib

，但由于您已经在使用

请求

（更人性化），您可以通过以下代码实现这一点：

import requests
import urllib

from bs4 import BeautifulSoup

page = requests.get("https://www.nationalgrid.com/uk/electricity/market-and-operational-data/data-explorer")
soup = BeautifulSoup(page.content, 'html.parser')


mainLocation = "http://www2.nationalgrid.com"
header = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

for a_link in soup.find_all('a', class_='download'):
    document_name = a_link["title"]
    document_url = mainLocation + a_link["href"]
    print('Getting file: {}'.format(document_url))
    page = requests.get(document_url, headers=header)
    file_to_store = a_link.get('href').split('/')[-1]
    with open('files/' + file_to_store, 'w') as f_out:
        f_out.write(page.content)

只有通过一个小技巧才能从链接中检索文件名。

这似乎不是问题所在，因为使用请求而不是urllib可以很好地工作。可能正确的方法是在请求中传递一些header值，以便工作。当我测试这个时，请求和urllib之间没有区别，它们都得到了200个用于被刮取的.csv文件。运行了一个完整性测试（mac的链接检查器），但没有抛出403。我最好的猜测是铲运机被发现并被切断了。添加一个有条件的time.sleep以及时跨越对较小文件的请求可能会解决此问题。谢谢您解决了我的问题！我只是放错了主位置，应该是：mainLocation=“”

import requests
import urllib

from bs4 import BeautifulSoup

page = requests.get("https://www.nationalgrid.com/uk/electricity/market-and-operational-data/data-explorer")
soup = BeautifulSoup(page.content, 'html.parser')


mainLocation = "http://www2.nationalgrid.com"
header = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
    'Accept-Language': 'en-US,en;q=0.8',
    'Connection': 'keep-alive'
}

for a_link in soup.find_all('a', class_='download'):
    document_name = a_link["title"]
    document_url = mainLocation + a_link["href"]
    print('Getting file: {}'.format(document_url))
    page = requests.get(document_url, headers=header)
    file_to_store = a_link.get('href').split('/')[-1]
    with open('files/' + file_to_store, 'w') as f_out:
        f_out.write(page.content)