Python Scrapy-分割导出文件的好方法?
我想将JsonLinesItemExporter导出的文件拆分为多个文件 只要爬行器解析了一定数量的项(MAX_项)。 下面的代码是一个有效的解决方案,但是我需要一些关于这个方法的输入。恐怕 当我显式调用spider_open()和spider_closed()关闭旧文件并打开新文件时,可能会出现一些问题。如有任何想法/最佳做法,我们将不胜感激:)Python Scrapy-分割导出文件的好方法?,python,scrapy,web-crawler,screen-scraping,Python,Scrapy,Web Crawler,Screen Scraping,我想将JsonLinesItemExporter导出的文件拆分为多个文件 只要爬行器解析了一定数量的项(MAX_项)。 下面的代码是一个有效的解决方案,但是我需要一些关于这个方法的输入。恐怕 当我显式调用spider_open()和spider_closed()关闭旧文件并打开新文件时,可能会出现一些问题。如有任何想法/最佳做法,我们将不胜感激:) # Define your item pipelines here # # Don't forget to add your pipeline to
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/topics/item-pipeline.html
from scrapy import signals
from scrapy import log
from scrapy.contrib.exporter import JsonLinesItemExporter
MAX_ITEMS = 10000
class DmozPipeline(object):
def process_item(self, item, spider):
return item
class JsonLinePipeline(object):
def __init__(self):
self.files = {}
self.ids_seen = set()
self.fileid = 0
self.filetype = ".json"
@classmethod
def from_crawler(cls, crawler):
pipeline = cls()
crawler.signals.connect(pipeline.spider_opened, signals.spider_opened)
crawler.signals.connect(pipeline.spider_closed, signals.spider_closed)
return pipeline
def spider_opened(self, spider):
file = open("items-" + str(self.fileid) + self.filetype, 'w+b')
self.files[spider] = file
self.exporter = JsonLinesItemExporter(file)
self.exporter.start_exporting()
def spider_closed(self, spider):
self.exporter.finish_exporting()
file = self.files.pop(spider)
file.close()
def process_item(self, item, spider):
i = len(self.ids_seen)
if i % MAX_ITEMS + 1 == True and i > 0:
self.spider_closed(spider)
self.fileid = self.fileid + 1
self.spider_opened(spider)
if item['link'][0] in self.ids_seen:
raise DropItem("Duplicate site found: %s" % item)
else:
self.ids_seen.add(item['link'][0])
self.exporter.export_item(item)
return item