使用Python和BeautifulSoup从网页下载.xls文件_Python_Web Scraping_Beautifulsoup

使用Python和BeautifulSoup从网页下载.xls文件

python web-scraping

使用Python和BeautifulSoup从网页下载.xls文件,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想从该网站将所有.xls或.xlsx或.csv下载到指定文件夹中 https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009 我已经研究了mechanize、beautiful soup、urllib2等。mechanize在Python 3中不起作用，urllib2在Python 3中也有问题，我寻找了解决方法，但我不能。所以，我现在正试图用漂亮的汤来让它起作用我找到了一些示例代码，并尝试对其进行修改以适应我的问题，如下所示- f

我想从该网站将所有

.xls

或

.xlsx

或

.csv

下载到指定文件夹中

https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009

我已经研究了mechanize、beautiful soup、urllib2等。mechanize在Python 3中不起作用，urllib2在Python 3中也有问题，我寻找了解决方法，但我不能。所以，我现在正试图用漂亮的汤来让它起作用

我找到了一些示例代码，并尝试对其进行修改以适应我的问题，如下所示-

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
    href = link.get('href')
    if href.startswith('javascript:'):
        continue
    filename = href.rsplit('/', 1)[-1]
    href = urljoin(url, quote(href))
    try:
        urlretrieve(href, filename)
    except:
        print('failed to download')

但是，运行时，此代码不会从目标页面提取文件，也不会输出任何失败消息（例如“下载失败”）

如何使用BeautifulSoup从页面中选择Excel文件
如何使用Python将这些文件下载到本地文件

您的脚本目前存在的问题有：

url

有一个尾随的

，请求时给出一个无效页面，没有列出要下载的文件

soup.select（…）

中的CSS选择器正在选择属性为

webpartid

的

div

，该属性在链接文档中的任何位置都不存在

您正在加入URL并引用它，即使页面中的链接是作为绝对URL提供的，它们不需要引用

try:…except:

块阻止您看到在尝试下载文件时生成的错误。在没有特定异常的情况下使用

except

块是不好的做法，应该避免

代码的修改版本将获得正确的文件并尝试下载它们，如下所示：

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin

# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")

# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')

    # Make sure it has one of the correct extensions
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = href.rsplit('/', 1)[-1]
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

但是，如果运行此命令，您会注意到抛出了一个

urllib.error.HTTPError:HTTP error 403:probled

异常，即使该文件可以在浏览器中下载。起初我以为这是一个推荐检查（防止热链接），但是如果你在浏览器中观看请求（例如Chrome开发者工具），你会注意到初始的

http://

请求也被阻止，然后Chrome尝试对同一文件发出

https://

请求

换句话说，请求必须通过HTTPS才能工作（不管页面中的URL如何）。要解决此问题，您需要在使用请求URL之前将

http:

重写为

https:

。以下代码将正确修改URL并下载文件。我还添加了一个变量来指定输出文件夹，该文件夹使用

os.path.join

添加到文件名中：

import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
    if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
        continue

    filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

    # We need a https:// URL for this site
    href = href.replace('http://','https://')

    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")

我发现这是一个很好的工作示例，使用Python 2.7的

BeautifulSoup4

、

请求

和

wget

模块：

import requests
import wget
import os

from bs4 import BeautifulSoup, SoupStrainer

url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'

file_types = ['.xls', '.xlsx', '.csv']

for file_type in file_types:

    response = requests.get(url)

    for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
        if link.has_attr('href'):
            if file_type in link['href']:
                full_path = url + link['href']
                wget.download(full_path)

这对我来说最有效。。。使用蟒蛇3

import os
import urllib

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
    continue

filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

# We need a https:// URL for this site
href = href.replace('http://','https://')

try:
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")
except urllib.error.HTTPError as err:
    if err.code == 404:
        continue

你能描述一下你的代码“不起作用”的方式吗？发布的代码缩进错误，因此根本无法运行。代码只是偶尔运行，但从未创建任何文件。关于缩进，我在发帖时表示歉意，我一定破坏了这一点，但请放心，当我运行代码时，我确实注意到了缩进。我有一个解决这个问题的有效方法，但是问题已经解决，所以我不能再发帖了。我把它作为一个要点发布在这里@mfitzp谢谢，这很有效。像你这样的人确保语言永远不会消亡！！！！添加以下行：您可以提供mfitzp编写的代码的下载位置。我意识到我有很多东西需要学习。我非常感谢你。

import os
import urllib

from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError

URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = ''  # path to output folder, '.' or '' uses current folder

u = urlopen(URL)
try:
    html = u.read().decode('utf-8')
finally:
    u.close()

soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
    href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
    continue

filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])

# We need a https:// URL for this site
href = href.replace('http://','https://')

try:
    print("Downloading %s to %s..." % (href, filename) )
    urlretrieve(href, filename)
    print("Done.")
except urllib.error.HTTPError as err:
    if err.code == 404:
        continue