Python 使用Scrapy下载PDF文档_Python_Pdf_Web Scraping_Scrapy_Downloadfile

Python 使用Scrapy下载PDF文档

python pdf web-scraping scrapy

Python 使用Scrapy下载PDF文档,python,pdf,web-scraping,scrapy,downloadfile,Python,Pdf,Web Scraping,Scrapy,Downloadfile,我试图下载pdf文件使用蜘蛛写的刮擦。我能够在一个页面上获取我需要的所有文档，但它们不是保存为pdf文件，而是保存为编码文本文件我从中下载的href标签如下所示 <a href="/utils/view?id=37a074754f8d7d7302e0a32d9b049054" target="_blank" title="Download/View Attachment_1_PandemicFlu.pdf" class="file" id="yui-gen6">Attachment

我试图下载pdf文件使用蜘蛛写的刮擦。我能够在一个页面上获取我需要的所有文档，但它们不是保存为pdf文件，而是保存为编码文本文件

我从中下载的href标签如下所示

<a href="/utils/view?id=37a074754f8d7d7302e0a32d9b049054" target="_blank" title="Download/View Attachment_1_PandemicFlu.pdf" class="file" id="yui-gen6">Attachment_1_Pandemi...</a>

更新：在下面的答案的帮助下，它开始工作了。这是我的解决办法。希望能有帮助

import scrapy
import requests

class fbo_spider(scrapy.Spider):
    name = "fbospider"

    start_urls = ["https://www.fbo.gov/spg/AOC/AOCPD/WashingtonDC/RFPPPA190087/listing.html"]

    def parse(self, response):

        base_url = "https://www.fbo.gov" # base url used build url from href link
        i = 1

        # xpath to retrieve the part of html which holds documents
        for link in response.xpath("//*[@class='pkglist']/dd/a"):
            relative_url = link.xpath(".//@href").extract_first()

            # ex: https://www.fbo.gov/utils/view?id=921ca3f6f2ae471ab579075b8dc37afb
            absolute_url = base_url + relative_url 

            # request to fetch pdf documents using absolute url
            r = requests.get(absolute_url)
            with open("file%s.pdf" % i, 'wb') as f:
                f.write(r.content)
            i+=1

使用请求库获取文件

import requests

def download(url):
    print('Beginning file download with requests')

    r = requests.get(url)

    with open('some_name.pdf', 'wb') as f:
        f.write(r.content)

    # Retrieve HTTP meta-data
    print(r.status_code)
    print(r.headers['content-type'])
    print(r.encoding)

download('https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054')

@Vishnudev在那个例子中，程序寻找“.pdf”-我的问题是链接没有扩展名

import requests

def download(url):
    print('Beginning file download with requests')

    r = requests.get(url)

    with open('some_name.pdf', 'wb') as f:
        f.write(r.content)

    # Retrieve HTTP meta-data
    print(r.status_code)
    print(r.headers['content-type'])
    print(r.encoding)

download('https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054')