使用Python和BeautifulSoup从网页下载.xls文件
我想从该网站将所有使用Python和BeautifulSoup从网页下载.xls文件,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,我想从该网站将所有.xls或.xlsx或.csv下载到指定文件夹中 https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009 我已经研究了mechanize、beautiful soup、urllib2等。mechanize在Python 3中不起作用,urllib2在Python 3中也有问题,我寻找了解决方法,但我不能。所以,我现在正试图用漂亮的汤来让它起作用 我找到了一些示例代码,并尝试对其进行修改以适应我的问题,如下所示- f
.xls
或.xlsx
或.csv
下载到指定文件夹中
https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009
我已经研究了mechanize、beautiful soup、urllib2等。mechanize在Python 3中不起作用,urllib2在Python 3中也有问题,我寻找了解决方法,但我不能。所以,我现在正试图用漂亮的汤来让它起作用
我找到了一些示例代码,并尝试对其进行修改以适应我的问题,如下所示-
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009/'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html)
for link in soup.select('div[webpartid] a'):
href = link.get('href')
if href.startswith('javascript:'):
continue
filename = href.rsplit('/', 1)[-1]
href = urljoin(url, quote(href))
try:
urlretrieve(href, filename)
except:
print('failed to download')
但是,运行时,此代码不会从目标页面提取文件,也不会输出任何失败消息(例如“下载失败”)
- 如何使用BeautifulSoup从页面中选择Excel文件
- 如何使用Python将这些文件下载到本地文件
url
有一个尾随的/
,请求时给出一个无效页面,没有列出要下载的文件soup.select(…)
中的CSS选择器正在选择属性为webpartid
的div
,该属性在链接文档中的任何位置都不存在try:…except:
块阻止您看到在尝试下载文件时生成的错误。在没有特定异常的情况下使用except
块是不好的做法,应该避免from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve, quote
from urllib.parse import urljoin
# Remove the trailing / you had, as that gives a 404 page
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
u = urlopen(url)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
# Select all A elements with href attributes containing URLs starting with http://
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
# Make sure it has one of the correct extensions
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = href.rsplit('/', 1)[-1]
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
但是,如果运行此命令,您会注意到抛出了一个urllib.error.HTTPError:HTTP error 403:probled
异常,即使该文件可以在浏览器中下载。
起初我以为这是一个推荐检查(防止热链接),但是如果你在浏览器中观看请求(例如Chrome开发者工具),你会注意到
初始的http://
请求也被阻止,然后Chrome尝试对同一文件发出https://
请求
换句话说,请求必须通过HTTPS才能工作(不管页面中的URL如何)。要解决此问题,您需要在使用请求URL之前将http:
重写为https:
。以下代码将正确修改URL并下载文件。我还添加了一个变量来指定输出文件夹,该文件夹使用os.path.join
添加到文件名中:
import os
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
我发现这是一个很好的工作示例,使用Python 2.7的
BeautifulSoup4
、请求
和wget
模块:
import requests
import wget
import os
from bs4 import BeautifulSoup, SoupStrainer
url = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
file_types = ['.xls', '.xlsx', '.csv']
for file_type in file_types:
response = requests.get(url)
for link in BeautifulSoup(response.content, 'html.parser', parse_only=SoupStrainer('a')):
if link.has_attr('href'):
if file_type in link['href']:
full_path = url + link['href']
wget.download(full_path)
这对我来说最有效。。。使用蟒蛇3
import os
import urllib
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
try:
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
except urllib.error.HTTPError as err:
if err.code == 404:
continue
你能描述一下你的代码“不起作用”的方式吗?发布的代码缩进错误,因此根本无法运行。代码只是偶尔运行,但从未创建任何文件。关于缩进,我在发帖时表示歉意,我一定破坏了这一点,但请放心,当我运行代码时,我确实注意到了缩进。我有一个解决这个问题的有效方法,但是问题已经解决,所以我不能再发帖了。我把它作为一个要点发布在这里@mfitzp谢谢,这很有效。像你这样的人确保语言永远不会消亡!!!!添加以下行:
import os
import urllib
from bs4 import BeautifulSoup
# Python 3.x
from urllib.request import urlopen, urlretrieve
from urllib.error import HTTPError
URL = 'https://www.rbi.org.in/Scripts/bs_viewcontent.aspx?Id=2009'
OUTPUT_DIR = '' # path to output folder, '.' or '' uses current folder
u = urlopen(URL)
try:
html = u.read().decode('utf-8')
finally:
u.close()
soup = BeautifulSoup(html, "html.parser")
for link in soup.select('a[href^="http://"]'):
href = link.get('href')
if not any(href.endswith(x) for x in ['.csv','.xls','.xlsx']):
continue
filename = os.path.join(OUTPUT_DIR, href.rsplit('/', 1)[-1])
# We need a https:// URL for this site
href = href.replace('http://','https://')
try:
print("Downloading %s to %s..." % (href, filename) )
urlretrieve(href, filename)
print("Done.")
except urllib.error.HTTPError as err:
if err.code == 404:
continue