Python 使用Scrapy下载PDF文档
我试图下载pdf文件使用蜘蛛写的刮擦。我能够在一个页面上获取我需要的所有文档,但它们不是保存为pdf文件,而是保存为编码文本文件 我从中下载的href标签如下所示Python 使用Scrapy下载PDF文档,python,pdf,web-scraping,scrapy,downloadfile,Python,Pdf,Web Scraping,Scrapy,Downloadfile,我试图下载pdf文件使用蜘蛛写的刮擦。我能够在一个页面上获取我需要的所有文档,但它们不是保存为pdf文件,而是保存为编码文本文件 我从中下载的href标签如下所示 <a href="/utils/view?id=37a074754f8d7d7302e0a32d9b049054" target="_blank" title="Download/View Attachment_1_PandemicFlu.pdf" class="file" id="yui-gen6">Attachment
<a href="/utils/view?id=37a074754f8d7d7302e0a32d9b049054" target="_blank" title="Download/View Attachment_1_PandemicFlu.pdf" class="file" id="yui-gen6">Attachment_1_Pandemi...</a>
更新:在下面的答案的帮助下,它开始工作了。这是我的解决办法。希望能有帮助
import scrapy
import requests
class fbo_spider(scrapy.Spider):
name = "fbospider"
start_urls = ["https://www.fbo.gov/spg/AOC/AOCPD/WashingtonDC/RFPPPA190087/listing.html"]
def parse(self, response):
base_url = "https://www.fbo.gov" # base url used build url from href link
i = 1
# xpath to retrieve the part of html which holds documents
for link in response.xpath("//*[@class='pkglist']/dd/a"):
relative_url = link.xpath(".//@href").extract_first()
# ex: https://www.fbo.gov/utils/view?id=921ca3f6f2ae471ab579075b8dc37afb
absolute_url = base_url + relative_url
# request to fetch pdf documents using absolute url
r = requests.get(absolute_url)
with open("file%s.pdf" % i, 'wb') as f:
f.write(r.content)
i+=1
使用请求库获取文件
import requests
def download(url):
print('Beginning file download with requests')
r = requests.get(url)
with open('some_name.pdf', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
download('https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054')
@Vishnudev在那个例子中,程序寻找“.pdf”-我的问题是链接没有扩展名
import requests
def download(url):
print('Beginning file download with requests')
r = requests.get(url)
with open('some_name.pdf', 'wb') as f:
f.write(r.content)
# Retrieve HTTP meta-data
print(r.status_code)
print(r.headers['content-type'])
print(r.encoding)
download('https://www.fbo.gov/utils/view?id=37a074754f8d7d7302e0a32d9b049054')