使用请求从PHP链接中提取PDF_Php_Python_Html_Pdf_Python Requests

使用请求从PHP链接中提取PDF

php python html pdf

使用请求从PHP链接中提取PDF,php,python,html,pdf,python-requests,Php,Python,Html,Pdf,Python Requests,第一次尝试抓取网页，在最后10%的问题上有点损失是否有一些技巧可以让php链接的下载包含我丢失的请求我试图从一个会议记录页面（php页面）中提取一些PDF。我可以成功登录并导航到使用Selenium链接PDF文件的所有单独页面（请记住，这是第一次，所以我喜欢看到页面被导航），并且可以获取所有论文的标题。然后我尝试下面的方法来获取链接的内容 # Get the links pdfLinks = browser.find_elements_by_tag_name('a') for pdfItem

第一次尝试抓取网页，在最后10%的问题上有点损失

是否有一些技巧可以让php链接的下载包含我丢失的请求

我试图从一个会议记录页面（php页面）中提取一些PDF。我可以成功登录并导航到使用Selenium链接PDF文件的所有单独页面（请记住，这是第一次，所以我喜欢看到页面被导航），并且可以获取所有论文的标题。然后我尝试下面的方法来获取链接的内容

# Get the links
pdfLinks = browser.find_elements_by_tag_name('a')
for pdfItem in range(len(pdfLinks)):
    if re.search('loadPDF', pdfLinks[pdfItem].get_attribute('href')):
        print(pdfLinks[pdfItem].text)

以上代码正确地识别了所有链接文本，目前为止效果良好

# Trying to prove out a simple save
for pdfItem in range(len(pdfLinks)):
    if re.search('session-5.5', pdfLinks[pdfItem].get_attribute('href')):
        print('Requesting "{0:s}" from {1:s}'.format(pdfLinks[pdfItem].text, str(pdfLinks[pdfItem].get_attribute('href'))))
        res = requests.get(pdfLinks[pdfItem].get_attribute('href'), stream=True, headers={'User-Agent': 'firefox'})
        res.raise_for_status()
        pdfFile = open('{0:s}.pdf'.format(pdfLinks[pdfItem].text), 'wb')
        for chunk in res.iter_content(100000):
            pdfFile.write(chunk)
        pdfFile.close()

我从中得到的响应标题是：

{'content-length': '112', 'date': 'Tue, 07 Jul 2015 13:46:27 GMT', 'connection': 'Keep-Alive', 'content-encoding': 'gzip', 'content-type': 'text/html', 'pragma': 'no-cache', 'set-cookie': 'PHPSESSID=7d62686unhha7q0muh4u41pe27; path=/', 'expires': 'Thu, 19 Nov 1981 08:52:00 GMT', 'vary': 'Accept-Encoding', 'keep-alive': 'timeout=5, max=100', 'cache-control': 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0', 'accept-ranges': 'none', 'server': 'Apache'}

我尝试过同样的代码删除

res.raise\u for_status（）

和

stream=True，headers={'User-Agent'：'firefox'}

位，但没有成功

如果我实际单击链接并在浏览器控制台中检查GET请求，我会看到请求标头和发送的COOKIE信息，后跟以下响应标头：

Response Headers Δ81ms
Server: Apache
Pragma: no-cache
Keep-Alive: timeout=5, max=100
Expires:    Thu, 19 Nov 1981 08:52:00 GMT
Date:   Tue, 07 Jul 2015 14:23:25 GMT
Content-Type:   application/pdf
Content-Transfer-Encoding:  binary
Content-Length: 1160565
Content-Disposition:    inline; filename="Custom file name for the.pdf"
Connection: Keep-Alive
Cache-Control:  no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Accept-Ranges:  bytes

因此，delta似乎是一个cookie在物理单击时随GET一起发送

代码可能很冗长，但我会先学习，然后再优化。此时，单击每个链接并重命名文件会更快，但我想知道这是如何工作的

感谢您提供的指导。

您是否尝试1）捕获Wireshark的对话或2）启用请求中的调试？虽然我不能确定，但这可能是自动重定向的问题。在“我从中得到的响应头是：”，消息的主体是什么？@jcopens from

res=requests.get（pdfLinks[pdfItem].get_属性（'href'），stream=True，headers={'User-Agent'：'firefox'}）

消息的主体是：

直到我刚查过，才知道Wireshark是什么。对于这个问题来说，似乎有些过分，但也许值得一看。我没有尝试在请求中启用调试，将进行调查。正文中的url似乎指示页面将在0秒内重定向到该新url。这是我不太清楚的

refresh

，但正文中的url似乎表明页面将在0秒内重定向到新的url。我知道请求遵循实际重定向（如果启用），但我现在不知道它是否遵循刷新。是的，Wireshark相当大。然而，如果您进行协议工作，它很容易使用，并且具有实用性。我在一个类似的项目上做了一些偷偷的调试，很高兴有了它。