Python Scrapy保存可下载文件
我正在编写一个刮擦式网络爬虫,它可以从我访问的页面中保存html。我还想用文件扩展名保存我爬网的文件 这就是我目前所拥有的 蜘蛛类Python Scrapy保存可下载文件,python,web-scraping,scrapy,scrapy-spider,Python,Web Scraping,Scrapy,Scrapy Spider,我正在编写一个刮擦式网络爬虫,它可以从我访问的页面中保存html。我还想用文件扩展名保存我爬网的文件 这就是我目前所拥有的 蜘蛛类 class MySpider(CrawlSpider): name = 'my name' start_urls = ['my url'] allowed_domains = ['my domain'] rules = (Rule (LinkExtractor(allow=()), callback="parse_item", f
class MySpider(CrawlSpider):
name = 'my name'
start_urls = ['my url']
allowed_domains = ['my domain']
rules = (Rule (LinkExtractor(allow=()), callback="parse_item", follow= True),
)
def parse_item(self,response):
item = MyItem()
item['url'] = response.url
item['html'] = response.body
return item
管道。py
save_path = 'My path'
if not os.path.exists(save_path):
os.makedirs(save_path)
class HtmlFilePipeline(object):
def process_item(self, item, spider):
page = item['url'].split('/')[-1]
filename = '%s.html' % page
with open(os.path.join(save_path, filename), 'wb') as f:
f.write(item['html'])
self.UploadtoS3(filename)
def UploadtoS3(self, filename):
...
有没有一种简单的方法可以检测链接是否以文件扩展名结尾并保存到该文件扩展名?无论扩展名是什么,我当前拥有的都将保存到.html
我想我可以搬走
filename = '%s.html' % page
它会另存为自己的扩展名,但有些情况下我想另存为html,比如如果它以aspx结尾,试试这个
import os
extension = os.path.splitext(url)[-1].lower()
#check if URL has GET request parameters and remove them (page.html?render=true)
if '?' in extension:
extension = extension.split('?')[0]
可能要检查它是否返回空-例如“”结尾没有.format
。
if not '.' in page:
fileName = '%s.html' % page
else:
fileName = page
如果“?”查找什么?它检查URL是否有GET请求参数并删除它们。示例:
http://google.com/page.html?render=true