Selenium Python下载带有特定文件名的弹出式pdf_Python_Selenium_Pdf_Screen Scraping

Selenium Python下载带有特定文件名的弹出式pdf

python selenium pdf

Selenium Python下载带有特定文件名的弹出式pdf,python,selenium,pdf,screen-scraping,Python,Selenium,Pdf,Screen Scraping,我需要从网页下载一组单独的pdf文件。它由政府（土耳其教育部）公开提供，因此完全合法然而，我的selenium浏览器只显示pdf文件，我如何下载它并按我的意愿命名（此代码也来自web）提前谢谢额外资料：我有一个Python2代码可以完美地实现这一点。但不知怎的，它创建了空文件，我无法将其转换为python 3。也许这有帮助（无意冒犯，但我从不喜欢硒）在非硒溶液中，您可以执行以下操作： import requests pdf_resp = requests.get("https://o

我需要从网页下载一组单独的pdf文件。它由政府（土耳其教育部）公开提供，因此完全合法

然而，我的selenium浏览器只显示pdf文件，我如何下载它并按我的意愿命名

（此代码也来自web）

提前谢谢

额外资料：

我有一个Python2代码可以完美地实现这一点。但不知怎的，它创建了空文件，我无法将其转换为python 3。也许这有帮助（无意冒犯，但我从不喜欢硒）

在非硒溶液中，您可以执行以下操作：

import requests
pdf_resp = requests.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")
with open("save.pdf", "wb") as f:
    f.write(pdf_resp.content)

尽管您可能希望在之前检查内容类型以确保它是pdf非硒解决方案，但您可以执行以下操作：

import requests
pdf_resp = requests.get("https://odsgm.meb.gov.tr/kurslar/PDFFile.aspx?name=kazanimtestleri.pdf")
with open("save.pdf", "wb") as f:
    f.write(pdf_resp.content)

尽管您可能希望在启动Chrome之前检查内容类型以确保其为pdf，但请在启动Chrome之前添加选项，然后指定

Chrome\u options

参数

download_dir = "/Users/ugur/Downloads/"
options = webdriver.ChromeOptions()

profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], 
           "download.default_directory": download_dir,
          "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)

driver = webdriver.Chrome(
    executable_path="/Users/ugur/Downloads/chromedriver",
    chrome_options=options
)

回答你的第二个问题：

请问如何指定文件名

我发现：

我所做的是：

file_name = ''
while file_name.lower().endswith('.pdf') is False:
    time.sleep(.25)
    try:
        file_name = max([download_dir + '/' + f for f in os.listdir(download_dir)], key=os.path.getctime)
    except ValueError:
        pass

在启动Chrome之前添加选项，然后指定

Chrome\u options

参数

download_dir = "/Users/ugur/Downloads/"
options = webdriver.ChromeOptions()

profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], 
           "download.default_directory": download_dir,
          "download.extensions_to_open": "applications/pdf"}
options.add_experimental_option("prefs", profile)

driver = webdriver.Chrome(
    executable_path="/Users/ugur/Downloads/chromedriver",
    chrome_options=options
)

回答你的第二个问题：

请问如何指定文件名

我发现：

我所做的是：

file_name = ''
while file_name.lower().endswith('.pdf') is False:
    time.sleep(.25)
    try:
        file_name = max([download_dir + '/' + f for f in os.listdir(download_dir)], key=os.path.getctime)
    except ValueError:
        pass

下面是我用来下载带有特定文件名的pdf的代码示例。首先，您需要使用必需的选项配置chrome webdriver。然后单击按钮（打开pdf弹出窗口）后，调用函数等待下载完成并重命名下载的文件

import os
import time
import shutil

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

# function to wait for download to finish and then rename the latest downloaded file
def wait_for_download_and_rename(newFilename):
    # function to wait for all chrome downloads to finish
    def chrome_downloads(drv):
        if not "chrome://downloads" in drv.current_url: # if 'chrome downloads' is not current tab
            drv.execute_script("window.open('');") # open a new tab
            drv.switch_to.window(driver.window_handles[1]) # switch to the new tab
            drv.get("chrome://downloads/") # navigate to chrome downloads
        return drv.execute_script("""
            return document.querySelector('downloads-manager')
            .shadowRoot.querySelector('#downloadsList')
            .items.filter(e => e.state === 'COMPLETE')
            .map(e => e.filePath || e.file_path || e.fileUrl || e.file_url);
            """)
    # wait for all the downloads to be completed
    dld_file_paths = WebDriverWait(driver, 120, 1).until(chrome_downloads) # returns list of downloaded file paths
    # Close the current tab (chrome downloads)
    if "chrome://downloads" in driver.current_url:
        driver.close()
    # Switch back to original tab
    driver.switch_to.window(driver.window_handles[0]) 
    # get latest downloaded file name and path
    dlFilename = dld_file_paths[0] # latest downloaded file from the list
    # wait till downloaded file appears in download directory
    time_to_wait = 20 # adjust timeout as per your needs
    time_counter = 0
    while not os.path.isfile(dlFilename):
        time.sleep(1)
        time_counter += 1
        if time_counter > time_to_wait:
            break
    # rename the downloaded file
    shutil.move(dlFilename, os.path.join(download_dir,newFilename))
    return

# specify custom download directory
download_dir = r'c:\Downloads\pdf_reports'

# for configuring chrome pdf viewer for downloading pdf popup reports
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', {
    "download.default_directory": download_dir, # Set own Download path
    "download.prompt_for_download": False, # Do not ask for download at runtime
    "download.directory_upgrade": True, # Also needed to suppress download prompt
    "plugins.plugins_disabled": ["Chrome PDF Viewer"], # Disable this plugin
    "plugins.always_open_pdf_externally": True, # Enable this plugin
    })

# get webdriver with options for configuring chrome pdf viewer
driver = webdriver.Chrome(options = chrome_options)

# open desired webpage
driver.get('https://mywebsite.com/mywebpage')

# click the button to open pdf popup
driver.find_element_by_id('someid').click()

# call the function to wait for download to finish and rename the downloaded file
wait_for_download_and_rename('My file.pdf')

# close the browser windows
driver.quit()

根据需要将timeout（120）设置为等待时间。

这是我用来下载带有特定文件名的pdf的代码示例。首先，您需要使用必需的选项配置chrome webdriver。然后单击按钮（打开pdf弹出窗口）后，调用函数等待下载完成并重命名下载的文件

import os
import time
import shutil

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait

# function to wait for download to finish and then rename the latest downloaded file
def wait_for_download_and_rename(newFilename):
    # function to wait for all chrome downloads to finish
    def chrome_downloads(drv):
        if not "chrome://downloads" in drv.current_url: # if 'chrome downloads' is not current tab
            drv.execute_script("window.open('');") # open a new tab
            drv.switch_to.window(driver.window_handles[1]) # switch to the new tab
            drv.get("chrome://downloads/") # navigate to chrome downloads
        return drv.execute_script("""
            return document.querySelector('downloads-manager')
            .shadowRoot.querySelector('#downloadsList')
            .items.filter(e => e.state === 'COMPLETE')
            .map(e => e.filePath || e.file_path || e.fileUrl || e.file_url);
            """)
    # wait for all the downloads to be completed
    dld_file_paths = WebDriverWait(driver, 120, 1).until(chrome_downloads) # returns list of downloaded file paths
    # Close the current tab (chrome downloads)
    if "chrome://downloads" in driver.current_url:
        driver.close()
    # Switch back to original tab
    driver.switch_to.window(driver.window_handles[0]) 
    # get latest downloaded file name and path
    dlFilename = dld_file_paths[0] # latest downloaded file from the list
    # wait till downloaded file appears in download directory
    time_to_wait = 20 # adjust timeout as per your needs
    time_counter = 0
    while not os.path.isfile(dlFilename):
        time.sleep(1)
        time_counter += 1
        if time_counter > time_to_wait:
            break
    # rename the downloaded file
    shutil.move(dlFilename, os.path.join(download_dir,newFilename))
    return

# specify custom download directory
download_dir = r'c:\Downloads\pdf_reports'

# for configuring chrome pdf viewer for downloading pdf popup reports
chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('prefs', {
    "download.default_directory": download_dir, # Set own Download path
    "download.prompt_for_download": False, # Do not ask for download at runtime
    "download.directory_upgrade": True, # Also needed to suppress download prompt
    "plugins.plugins_disabled": ["Chrome PDF Viewer"], # Disable this plugin
    "plugins.always_open_pdf_externally": True, # Enable this plugin
    })

# get webdriver with options for configuring chrome pdf viewer
driver = webdriver.Chrome(options = chrome_options)

# open desired webpage
driver.get('https://mywebsite.com/mywebpage')

# click the button to open pdf popup
driver.find_element_by_id('someid').click()

# call the function to wait for download to finish and rename the downloaded file
wait_for_download_and_rename('My file.pdf')

# close the browser windows
driver.quit()

根据需要将超时（120）设置为等待时间。

并非所有站点都使用

application/pdf

发送pdf。您必须实际使用浏览器检查器检查服务器在您的案例中实际发送的内容类型。并非所有网站都使用

application/pdf

发送pdf。您必须实际使用浏览器检查器检查服务器在您的案例中实际发送的内容类型。这很有效。谢谢。我可以问一下如何指定文件名吗？你太好了。这非常有效。非常感谢。这很有效。谢谢。我可以问一下如何指定文件名吗？你太好了。这非常有效。非常感谢。但是，每次单击时，此文件都会更改。（我不知道为什么，在页面源代码处有一个dopostback功能。因此我们应该从这里开始，选择每个pdf文件并下载它们：/Yeah看起来它可能是一个与会话相关的东西，而不是一个行为良好的静态URL。您可以使用post请求或请求

会话

（您可以尝试在浏览器中跟踪网络，并对其进行反向工程）这听起来像超出我的想象：）但非常感谢您的帮助。我将记住这一点，以备将来使用。谢谢。但是每次单击此文件时都会更改。（我不知道为什么，在页面源代码处有一个dopostback功能。因此我们应该从这里开始，选择每个pdf文件并下载它们：/Yeah看起来它可能是一个与会话相关的东西，而不是一个行为良好的静态URL。您可以使用post请求或请求

会话

（你可以试着在你的浏览器中跟踪网络，并进行反向工程）这听起来像超出我的想象：）但非常感谢你的帮助。我会记住这一点，以备将来使用。