Python 3.x 使用PythonPt.2下载PDF_Python 3.x_Pdf_Web Scraping

Python 3.x 使用PythonPt.2下载PDF

python-3.x pdf web-scraping

Python 3.x 使用PythonPt.2下载PDF,python-3.x,pdf,web-scraping,Python 3.x,Pdf,Web Scraping,我正在尝试下载几个PDF文件，它们位于一个URL中的不同超链接中。我已经问了一个类似的问题，但是这个URL有一个不同的结构。包含PDF的URL包含代码中包含的文本“p_p_col_count%3D”，但由于某些原因，它不起作用还有另一个解决方案，但在这里，web页面（在我看来）有一个很好的结构良好的HTML代码，而我正试图抓取的页面有12行塞满的代码。此外，解决方案网页中的PDF可以在单个链接中下载，而在我的情况下，您需要确定正确的URL，然后下载它们这是迄今为止的“我的”代码： impor

我正在尝试下载几个PDF文件，它们位于一个URL中的不同超链接中。我已经问了一个类似的问题，但是这个URL有一个不同的结构。包含PDF的URL包含代码中包含的文本“p_p_col_count%3D”，但由于某些原因，它不起作用

还有另一个解决方案，但在这里，web页面（在我看来）有一个很好的结构良好的HTML代码，而我正试图抓取的页面有12行塞满的代码。此外，解决方案网页中的PDF可以在单个链接中下载，而在我的情况下，您需要确定正确的URL，然后下载它们

这是迄今为止的“我的”代码：

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/web/guest/resultados/proceso-auditor/auditorias-liberadas/sector-infraestructura-fisica-y-telecomunicaciones-comercio-exterior-y-desarrollo-regional/auditorias-liberadas-infraestructura-2019'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"lxml")
    for item in soup.select("table.table > tbody.table-data td.first > a[href*='p_p_col_count%3D']"):
        inner_link = item.get("href")
        resp = s.get(inner_link)
        soup = BeautifulSoup(resp.text,"lxml")
        pdf_link = soup.select_one("a.taglib-icon:contains('Descargar')").get("href")
        file_name = pdf_link.split("/")[-2].split("/")[-1]
        with open(f"{file_name}.pdf","wb") as f:
            f.write(s.get(pdf_link).content)

致以最诚挚的问候

您对

CSS

选择器有一些问题，而且还有一些改进文件名处理的空间，因为它们不太容易统一

您可以尝试以下方法：

import re
from urllib.parse import unquote

import requests
from bs4 import BeautifulSoup

link = 'https://www.contraloria.gov.co/web/guest/resultados/proceso-auditor/auditorias-liberadas/sector-infraestructura-fisica-y-telecomunicaciones-comercio-exterior-y-desarrollo-regional/auditorias-liberadas-infraestructura-2019'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    soup = BeautifulSoup(s.get(link).text, "lxml")
    follow_links = [
        link["href"] for link
        in soup.select(".aui .asset-abstract .asset-content .asset-more a")
    ]

    for follow_link in follow_links:
        soup = BeautifulSoup(s.get(follow_link).text, "lxml")
        pdf_link = soup.select_one(
            ".aui .view .lfr-asset-column-details .download-document a"
        ).get("href")
        pdf_response = s.get(pdf_link)
        pdf_name = pdf_response.headers["Content-Disposition"]
        file_name = "_".join(
            unquote(
                re.split(r"\d{3}", pdf_name, 1)[-1]
            ).split()
        ).replace('"', "")
        print(f"Fetching {file_name}")
        with open(file_name, "wb") as f:
            f.write(pdf_response.content)

输出：

Fetching Actuación_Especial_Contrato_de_Concesión_del_Aeropuerto_El_Dorado.pdf
Fetching ACTUACION_ESPECIAL_DE_FISCALIZACI+ôN_FONDO_DE_ADAPTACION-PUENTE_HISGAURA_MALAGA_LOS_CUROS.pdf
Fetching ACTUACION_ESPECIAL_DE_CONTROL_FISCAL_SERVICIOS_POSTALES_NACIONALES_S.A._472.pdf
Fetching Actuación_Especial_de_Control_Fiscal_-Convenios_suscritos_por_la_Agencia_Nacional_Inmobiliaria_Virgilio_Barco_Vargas.pdf
Fetching Cumplimiento_Superintendencia_de_Transporte.pdf
Fetching Cumplimiento_ANI-_Corredor_Vial_Bogota-Villavicencio.pdf
Fetching Financiera_Cámara_de_Comercio_de_Armenia_y_del_Quind+¡o.pdf
...

这回答了你的问题吗？这回答了你的问题吗？嗨，我替换了URL，它没有下载文档，HTML结构完全不同。我不是一个HTML专家，但与我所关注的页面相比，英国公司之家的代码编写得很好。您好，谢谢您的帮助，我设法下载了前两个pdf，但这里出现了错误

-->26，其中open（文件名，“wb”）为f:

和此文档

OSError:[Errno 22]无效参数：“ACTUACION\u Special\u DE\u CONTROL\u Financial\u SERVICIOS\u POSTALES\u NACIONALES\u S.A.。\u 472.pdf”

。我知道您设法下载了这些文档，可能是我方面的编目问题。这可能与文件名中的

“

有关。那些文件名的格式太差了。我已经更新了答案，请立即尝试。谢谢。我设法下载了文档。关于网页，我猜HTML代码也是一团糟。