Python 无法从网站下载pdf文件_Python_Web Scraping_Beautifulsoup

Python 无法从网站下载pdf文件

python web-scraping

Python 无法从网站下载pdf文件,python,web-scraping,beautifulsoup,Python,Web Scraping,Beautifulsoup,嗨，我有以下代码，我想从网站下载pdf“https://www.journal-officiel.gouv.fr/balo/recherche/resultats?parutionDateStart=2021-05-17&PARIONDATEND=2021-05-17&U token=0oP3\U cJ2xZ10SbEEGoNdP6vUpAIv5nBkrTZptI0Nzd8“ 这是我用来下载文件的脚本，但是没有一个pdf被下载虽然它没有给出任何错误，但每次都会创建一个空白文件夹 from se

嗨，我有以下代码，我想从网站下载pdf“https://www.journal-officiel.gouv.fr/balo/recherche/resultats?parutionDateStart=2021-05-17&PARIONDATEND=2021-05-17&U token=0oP3\U cJ2xZ10SbEEGoNdP6vUpAIv5nBkrTZptI0Nzd8“

这是我用来下载文件的脚本，但是没有一个pdf被下载

虽然它没有给出任何错误，但每次都会创建一个空白文件夹

from selenium import webdriver
from selenium.webdriver.support.ui import Select
driver = webdriver.Chrome(executable_path='C:\\Users\\u6080267\\Documents\\chromedriver.exe')
driver.get("https://www.journal-officiel.gouv.fr/balo/recherche/")
link = driver.find_element_by_xpath("//a[contains(@href,'token')]")
link.click()

url1 = driver.current_url
import urllib.request
from bs4 import BeautifulSoup
import os
import urllib
from datetime import datetime
import requests
HEADERS = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.106 Safari/537.36'}
def get_urls(url):
    
    url = url1    
    #url = https://www.journal-officiel.gouv.fr/balo/recherche/resultats?parutionDateStart=2021-05-03&parutionDateEnd=2021-05-03&_token=48BMi0HUW0CZJVdbccoO_wX9IzRJfglO8Uq-K0lfMNg
    req = urllib.request.Request(url, None, HEADERS)
    opener = urllib.request.build_opener()
    content = opener.open(req).read()  
    soup = BeautifulSoup(content, "html.parser")
    soup.prettify()      
    urls = {}
    for anchor in soup.findAll('a', href=True): #Going inside links        
        if "/balo/document" in anchor.get('href'):            
            name = anchor.get('href')[(anchor.get('href').rindex("=")+1):]
            url = "https://www.journal-officiel.gouv.fr/" + anchor.get('href')

            if name not in urls:
                urls[name]=url

    return urls
def download(urls, path):
    os.chdir(path)
    for name, url in urls.items():
        try:
            res = requests.get(url, allow_redirects=True)            
            # programmatic access requires a form to be submitted. On agreeing the consent, the pdfurl can be used
            soup = BeautifulSoup(res.content, "html.parser")
            soup.prettify()
            for pdfurl in soup.findAll(attrs={"name": "pdfURL"}):
                downloadurl = "https://www.journal-officiel.gouv.fr/" + pdfurl.get('value')
                res = requests.get(downloadurl)
                open(name + ".pdf", 'wb').write(res.content)
                print ("Downloaded", name + ".pdf")
            
        except Exception as e:
            print ("Failed to download", name, ", because of", e)   

def main():
    pathToStoreFiles = os.getcwd() + "\\" + datetime.today().strftime('%Y-%m-%d')
    os.makedirs(pathToStoreFiles)
    urls = get_urls('')
    download(urls, pathToStoreFiles)
if __name__ == "__main__":
    main()

显然，您没有获得正确的

pdf

下载链接。此外，您正在使刮削变得比必须的复杂得多，也就是说，没有必要使用

selenium

的重炮

bs4

可以轻松访问

令牌和查询url
然后，您可以使用它来获取结果
HTML
，并为pdf链接
解析它
下面是如何下载第一批50
文件
导入时间
导入请求
从bs4导入BeautifulSoup
主url=”https://www.journal-officiel.gouv.fr"
搜索路径=“/balo/recherche/”
def wait_a_位（wait_for:float=1.5）：
时间。睡眠（等待）
将requests.Session（）作为连接：
connection.headers[“用户代理”]=“Mozilla/5.0（X11；Linux x86_64）AppleWebKit/537.36（KHTML，如Gecko）Chrome/51.0.2704.106 Safari/537.36”
搜索\u url=(
美化组（connection.get（f“{main\u url}{search\u path}”）.text，“lxml”）
.find_all（“a”，class=“aide link”）[-1][“href”]
)
pdf_链接=(
f'{main_url}{link.find（“a”）[“href”]}中的链接
美丽之群(
get（f“{main\u url}{search\u url}&limit=50”）.text，“lxml”
)
.选择（“.下载链接”）
)
对于pdf_链接中的pdf_链接：
打印（f“获取{pdf_link}”）
pdf\u file=connection.get（pdf\u link）.content
以open（f'{pdf_link.rsplit（“/”[-1]}.pdf'，“wb”）作为输出：
output.write（pdf_文件）
等等

输出：
Fetching https://www.journal-officiel.gouv.fr/balo/document/202105172101761-59
Fetching https://www.journal-officiel.gouv.fr/balo/document/202105172101784-59
Fetching https://www.journal-officiel.gouv.fr/balo/document/202105172101798-59
Fetching https://www.journal-officiel.gouv.fr/balo/document/202105172101801-59
Fetching https://www.journal-officiel.gouv.fr/balo/document/202105172101810-59

and more ...

所有文件都保存在脚本的当前目录中，格式为：
202105172101675-59.pdf
202105172101686-59.pdf
202105172101687-59.pdf
202105172101688-59.pdf
202105172101697-59.pdf
...

干净、简洁、高效且对服务器友好。