Python Scrapy保存可下载文件

Python Scrapy保存可下载文件,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我正在编写一个刮擦式网络爬虫,它可以从我访问的页面中保存html。我还想用文件扩展名保存我爬网的文件 这就是我目前所拥有的 蜘蛛类 class MySpider(CrawlSpider): name = 'my name' start_urls = ['my url'] allowed_domains = ['my domain'] rules = (Rule (LinkExtractor(allow=()), callback="parse_item", f

我正在编写一个刮擦式网络爬虫,它可以从我访问的页面中保存html。我还想用文件扩展名保存我爬网的文件

这就是我目前所拥有的 蜘蛛类

class MySpider(CrawlSpider):
    name = 'my name'  
    start_urls = ['my url']
    allowed_domains = ['my domain']
    rules = (Rule (LinkExtractor(allow=()), callback="parse_item", follow= True),
  )

    def parse_item(self,response): 
        item = MyItem()
        item['url'] = response.url
        item['html'] = response.body
        return item
管道。py

save_path = 'My path'

if not os.path.exists(save_path):
    os.makedirs(save_path)

class HtmlFilePipeline(object):
    def process_item(self, item, spider):
        page = item['url'].split('/')[-1]
        filename = '%s.html' % page
        with open(os.path.join(save_path, filename), 'wb') as f:
            f.write(item['html'])
        self.UploadtoS3(filename)

    def UploadtoS3(self, filename):
    ...
有没有一种简单的方法可以检测链接是否以文件扩展名结尾并保存到该文件扩展名?无论扩展名是什么,我当前拥有的都将保存到.html

我想我可以搬走

filename = '%s.html' % page
它会另存为自己的扩展名,但有些情况下我想另存为html,比如如果它以aspx结尾,试试这个

import os

extension = os.path.splitext(url)[-1].lower()
#check if URL has GET request parameters and remove them (page.html?render=true)
if '?' in extension:
    extension = extension.split('?')[0]
可能要检查它是否返回空-例如“”结尾没有
.format

   if not '.' in page:
        fileName = '%s.html' % page        
    else:
        fileName = page

如果“?”查找什么?它检查URL是否有GET请求参数并删除它们。示例:
http://google.com/page.html?render=true