Python 3.x 使用Python抓取aspx页面_Python 3.x_Web Scraping_Beautifulsoup

Python 3.x 使用Python抓取aspx页面

python-3.x web-scraping

Python 3.x 使用Python抓取aspx页面,python-3.x,web-scraping,beautifulsoup,Python 3.x,Web Scraping,Beautifulsoup,我从来没有使用过网络抓取，但现在我认为这是唯一能帮助我做的事情。所以我查看了互联网上的一个示例代码。这个关于StackOverflow的公认答案似乎就是我要寻找的答案：这不起作用，给了我一个“403禁止错误”，因为@andrej Kesely说：我必须指定用户代理然后我在他的回答后更新了问题： import os import requests from urllib.parse import urljoin from bs4 import BeautifulSoup # an examp

我从来没有使用过网络抓取，但现在我认为这是唯一能帮助我做的事情。所以我查看了互联网上的一个示例代码。这个关于StackOverflow的公认答案似乎就是我要寻找的答案：

这不起作用，给了我一个“403禁止错误”，因为@andrej Kesely说：我必须指定用户代理

然后我在他的回答后更新了问题：

import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup

# an example of a working url
#url = "http://www.gatsby.ucl.ac.uk/teaching/courses/ml1-2016.html"
# my url (still not working)
url = 'http://www.covidmaroc.ma/Pages/LESINFOAR.aspx'

#You can use http://httpbin.org/get to see User-Agent in your browser. mine is
headers = {'User-Agent': 'Mozilla/5.0'} #Mozilla/5.0 #Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36"'}

#If there is no such folder, the script will create one automatically
folder_location = 'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for a in soup.select("a[href$='.pdf']"):
    filename = os.path.join(folder_location,a['href'].split('/')[-1])
    with open(filename, 'wb') as f:
        f.write(requests.get(urljoin(url,a['href'])).content)

现在它工作正常，并且创建了PDF文件。但当我试图打开任何PDF文件时，它就是无法在我拥有的任何PDF阅读器中打开，甚至在chrome中，它会显示“错误：无法加载PDF文档”。此外，刮取的PDF仅为179字节，而“手动”下载的PDF为1.XX Mb

请尝试在请求

标题=

中指定

用户代理

：

import requests
from bs4 import BeautifulSoup


url = 'http://www.covidmaroc.ma/Pages/LESINFOAR.aspx'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:86.0) Gecko/20100101 Firefox/86.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for a in soup.select("a[href$='.pdf']"):
    print(a['href'])

印刷品：

...

/Documents/BULLETIN/BQ_SARS-CoV-2.5.9.20.pdf
/Documents/BULLETIN/BQ_SARS-CoV-2.4.9.20.pdf
/Documents/BULLETIN/BQ_SARS-CoV-2.4.9.20.pdf
/Documents/BULLETIN/BULLETIN%20COVID-19Quotidien_03092020.pdf
/Documents/BULLETIN/BULLETIN%20COVID-19Quotidien_03092020.pdf

编辑：另外，将

标题=

放入最后的

请求中。get（）

：

您好，感谢您的反馈，我已经尝试了您的解决方案，它工作没有错误，并创建了PDF文件。但当我试图打开任何PDF文件时，它就是无法在我拥有的任何PDF阅读器中打开，甚至在chrome中，它会显示“错误：无法加载PDF文档”。此外，刮取的PDF只有179字节，而“手动”下载的PDF只有1.XX字节Mb@AmineChadi可能您还需要在

f.write（requests.get（urljoin（url，a['href']），headers=headers.content）中提供headers=。是的，非常感谢，这非常有帮助，更新您的答案，以便我将其作为正确答案进行检查。
...
with open(filename, 'wb') as f:
    f.write(requests.get(urljoin(url,a['href']), headers=headers).content)