如何使用Python在等待页面中等待，然后下载PDF？问题_Python_Http_Selenium_Pdf_Download

如何使用Python在等待页面中等待，然后下载PDF？问题

python http selenium pdf download

如何使用Python在等待页面中等待，然后下载PDF？问题,python,http,selenium,pdf,download,Python,Http,Selenium,Pdf,Download,我正在尝试从一个建立在破旧主机上的网站下载PDF文件，为了支持流量，该网站实现了等待页面。等待页面将呈现，您将花费几秒钟的时间查看它，而不是您想要的PDF，然后它将消失，您将去您想去的地方以下是我的设想：我去那一页也许有33%的时间，我得到了等待页面。以下是等待页面代码：问题我没办法用硒来做这件事。所以我的问题是：如何加载页面，在等待页面中等待，然后使用Python（2.7）下载随后提供的PDF？或者，如果硒可以做到这一点，我该如何做到榜样解决办法目前我正在使用： r =

我正在尝试从一个建立在破旧主机上的网站下载PDF文件，为了支持流量，该网站实现了等待页面。等待页面将呈现，您将花费几秒钟的时间查看它，而不是您想要的PDF，然后它将消失，您将去您想去的地方

以下是我的设想：

我去那一页

也许有33%的时间，我得到了等待页面。以下是等待页面代码：

问题我没办法用硒来做这件事。所以我的问题是：

如何加载页面，在等待页面中等待，然后使用Python（2.7）下载随后提供的PDF？

或者，如果硒可以做到这一点，我该如何做到

榜样

解决办法目前我正在使用：

r = requests.get(req_str)
while "waiting-main" in r.text:
    time.sleep(5)
    r = requests.get(req_str)

目前还没有关于它工作得如何的消息

页面

我会忽略等待页面。查找下载页面上存在但等待页面上不存在的特定元素，然后等待它。只要确保你等待的时间足够长，等待页面肯定会消失（可能是30秒或更长？你可能需要尝试一下，看看它是如何运行的）

从您提供的HTML来看，您似乎可以等待

EMBED

元素。我建议使用

WebDriverWait

并使用CSS选择器

“嵌入[name='plugin']”

您可以在这里找到有关Selenium等待Python的更多信息：。

我将忽略等待页面。查找下载页面上存在但等待页面上不存在的特定元素，然后等待它。只要确保你等待的时间足够长，等待页面肯定会消失（可能是30秒或更长？你可能需要尝试一下，看看它是如何运行的）

从您提供的HTML来看，您似乎可以等待

EMBED

元素。我建议使用

WebDriverWait

并使用CSS选择器

“嵌入[name='plugin']”

您可以在此处找到有关Selenium waits for Python的更多信息：。

我可以使用请求一致地获取页面源代码，这将获取pdf链接并保存它：

from  bs4 import BeautifulSoup
import requests
from urlparse import urljoin

# gets the page when you click the pdf link in your browser
post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"
r = requests.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

soup = BeautifulSoup(r.content)
# parse the form key/value pairs
form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}
# post to from data
nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

# get the link to the pdf to download
pdf = urljoin(base, soup.select_one("iframe")["src"])

# save pdf to file.
with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)

如果您遇到等待问题，可以等待表单在selenium中可见，然后将源代码传递给bs4：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def wait(dr, x, t):
    element = WebDriverWait(dr, t).until(
        EC.presence_of_all_elements_located((By.XPATH, x))
    )
    return element

dr = webdriver.PhantomJS()
dr.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

wait(dr, "//form[@action='CofoJobDocumentServlet']", 30)

post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"

soup = BeautifulSoup(dr.page_source)

form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}

nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

pdf = urljoin(base, soup.select_one("iframe")["src"])

with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)

我可以使用请求一致地获取页面源，这将获取pdf链接并保存它：

from  bs4 import BeautifulSoup
import requests
from urlparse import urljoin

# gets the page when you click the pdf link in your browser
post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"
r = requests.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

soup = BeautifulSoup(r.content)
# parse the form key/value pairs
form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}
# post to from data
nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

# get the link to the pdf to download
pdf = urljoin(base, soup.select_one("iframe")["src"])

# save pdf to file.
with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)

如果您遇到等待问题，可以等待表单在selenium中可见，然后将源代码传递给bs4：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC


def wait(dr, x, t):
    element = WebDriverWait(dr, t).until(
        EC.presence_of_all_elements_located((By.XPATH, x))
    )
    return element

dr = webdriver.PhantomJS()
dr.get("http://a810-bisweb.nyc.gov/bisweb/COsByLocationServlet?requestid=4&allbin=1006360")

wait(dr, "//form[@action='CofoJobDocumentServlet']", 30)

post_url = "http://a810-bisweb.nyc.gov/bisweb/CofoJobDocumentServlet"
base = "http://a810-bisweb.nyc.gov/bisweb/"

soup = BeautifulSoup(dr.page_source)

form_data = {inp["name"]: inp["value"] for inp in soup.select("form[action=CofoJobDocumentServlet] input")}

nr = requests.post(post_url, data=form_data)
soup = BeautifulSoup(nr.content)

pdf = urljoin(base, soup.select_one("iframe")["src"])

with open("out.pdf","wb") as out:
    out.write(requests.get(pdf).content)

您需要为pdf设置下载路径，并添加“始终在外部打开pdf”选项

driver_path = "path_from_chromedriver"
download_path = "./PdfFolder"
optionsSelenium = Options() // from selenium.webdriver.chrome.options import Options
optionsSelenium.add_experimental_option('prefs',  {
    "download.default_directory": download_path,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True
    }
)
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)

始终显示带有PDF的页面将仅下载内容并关闭新选项卡

您需要为PDF设置下载路径，并添加始终从外部打开PDF的选项

driver_path = "path_from_chromedriver"
download_path = "./PdfFolder"
optionsSelenium = Options() // from selenium.webdriver.chrome.options import Options
optionsSelenium.add_experimental_option('prefs',  {
    "download.default_directory": download_path,
    "download.prompt_for_download": False,
    "download.directory_upgrade": True,
    "plugins.always_open_pdf_externally": True
    }
)
driver = webdriver.Chrome(executable_path=driver_path, chrome_options=options)

始终显示带有PDF的页面将只下载内容并关闭新选项卡

您可以使用

WebDriverWait（driver，10）。直到没有（某些内容将消失）

等待加载程序窗口关闭。至于第二部分，我不确定我是否正确理解了后面的PDF文件。。。你是什么意思？不幸的是，我的描述被我试图解决的问题的非完全理解所阻碍。我已经更新了它，试图使我的问题更清楚-欢迎反馈！如果你仔细看，你会注意到这个网站实际上有一个你可以直接点击的URL。如果你能弄清楚URL是如何构造的，你就可以简化整个过程；

CofoodDocumentContentServlet

偶尔会提供等待通知。您可以使用

WebDriverWait（驱动程序，10）。直到没有（某些东西要消失）

CofoodDocumentContentServlet

偶尔为等待通知提供服务。我不确定这是否解决了我的问题。

requests.get（pdf）

与等待有什么关系？这不是一个单独的进程，它会再次触发等待时间吗？@ResMar，哪个请求会导致等待？对其界面中网页的任何请求都会生成一个等待页面。也就是说，尝试加载包含PDF列表的页面和加载PDF本身都可以生成PDF。我最终使用了一个愚蠢的五秒规则等待时间，以防我点击等待页面，源代码。嗯。我不确定这是否解决了我的问题。

requests.get（pdf）

与等待有什么关系？这不是一个单独的进程，它会再次触发等待时间吗？@ResMar，哪个请求会导致等待？对其界面中网页的任何请求都会生成一个等待页面。也就是说，尝试加载包含PDF列表的页面和加载PDF本身都可以生成PDF。我最终使用了一个愚蠢的五秒钟规则等待时间，以防我点击等待页面，源代码。