使用python下载PDF_Python_Pdf_Web Scraping

使用python下载PDF

python pdf web-scraping

使用python下载PDF,python,pdf,web-scraping,Python,Pdf,Web Scraping,我正在尝试下载几个PDF文件，它们位于一个URL中的不同超链接中。我的方法是首先检索包含“fileEntryId”文本（包含PDF）的URL，然后尝试使用这种方法下载PDF文件这是迄今为止的“我的”代码： import httplib2 from bs4 import BeautifulSoup, SoupStrainer import re import os import requests from urllib.parse import urljoin http = httplib2

我正在尝试下载几个PDF文件，它们位于一个URL中的不同超链接中。我的方法是首先检索包含“fileEntryId”文本（包含PDF）的URL，然后尝试使用这种方法下载PDF文件

这是迄今为止的“我的”代码：

import httplib2
from bs4 import BeautifulSoup, SoupStrainer
import re
import os
import requests
from urllib.parse import urljoin


http = httplib2.Http()
status, response = http.request('https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015')

for link in BeautifulSoup(response, 'html.parser', parse_only=SoupStrainer('a', href=re.compile('.*fileEntryId.*'))):
    if link.has_attr('href'):
        x=link['href']
        
        #If there is no such folder, the script will create one automatically
        folder_location = r'c:\webscraping'
        if not os.path.exists(folder_location):os.mkdir(folder_location)

        response = requests.get(x)
        soup= BeautifulSoup(response.text, "html.parser")     
        for link in soup.select("x"):
            #Name the pdf files using the last portion of each link which are unique in this case
            filename = os.path.join(folder_location,link['href'].split('/')[-1])
            with open(filename, 'wb') as f:
                f.write(requests.get(urljoin(url,link['href'])).content)

谢谢

在任意位置创建一个文件夹，并将脚本放在该文件夹中。运行脚本时，应在文件夹中获取下载的pdf文件。如果由于某种原因脚本不适合您，请确保检查您的bs4版本是否是最新的，因为我使用了伪css选择器来定位所需的链接

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/resultados/proceso-auditor/auditorias-liberadas/regalias/auditorias-regalias-liberadas-ano-2015'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table > tbody.table-data td.first > a[href*='fileEntryId']"):
        inner_link = item.get("href")
        resp = s.get(inner_link)
        soup = BeautifulSoup(resp.text,"lxml")
        pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
        file_name = pdf_link.split("/")[-1].split("?")[0]
        with open(f"{file_name}.pdf","wb") as f:
            f.write(s.get(pdf_link).content)

PDF不是“嵌入式”的。你看过你正在获取的这些页面的源代码了吗？您正在搜索

标记，我认为没有任何

标记。页面很复杂，谢谢你的反馈，我对问题做了修改，以避免错误的解决方案。