Python 如何创建自定义刮擦物品导出器？_Python_Json_Scrapy

Python 如何创建自定义刮擦物品导出器？

python json scrapy

Python 如何创建自定义刮擦物品导出器？,python,json,scrapy,Python,Json,Scrapy,我正在尝试基于JsonLinesItemExporter创建一个自定义的Scrapy Item Exporter，以便稍微更改它生成的结构我已经阅读了这里的文档，但它没有说明如何创建自定义导出器、将其存储在何处或如何将其链接到管道我已经确定了如何和饲料出口商进行定制，但这不符合我的要求，因为我想从我的渠道中给这个出口商打电话下面是我找到的代码，它存储在名为exporters.py的项目根目录中的一个文件中 from scrapy.exporters import JsonItemExpor

我正在尝试基于JsonLinesItemExporter创建一个自定义的Scrapy Item Exporter，以便稍微更改它生成的结构

我已经阅读了这里的文档，但它没有说明如何创建自定义导出器、将其存储在何处或如何将其链接到管道

我已经确定了如何和饲料出口商进行定制，但这不符合我的要求，因为我想从我的渠道中给这个出口商打电话

下面是我找到的代码，它存储在名为

exporters.py的项目根目录中的一个文件中
from scrapy.exporters import JsonItemExporter

class FanItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        # To initialize the object using JsonItemExporter's constructor
        super().__init__(file)

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')


从scrapy.contrib.exporter导入JSOnlineSiteMeExporter
类FanItemExporter（JSOnlineSiteMeExporter）：
定义初始化（self，file，**kwargs）：
自我配置（kwargs，don\u fail=True）
self.file=文件
self.encoder=ScrapyJSONEncoder（**kwargs）
self.first_item=True
def start_导出（自）：
self.file.write（“”）{
“产品”：[“”）
def finish_导出（自）：
self.file.write（“]}”）
def导出_项目（自身，项目）：
如果self.first\u项目：
self.first_item=False
其他：
self.file.write（'，\n'）
itemdict=dict（self.\u获取\u序列化的\u字段（item））
self.file.write（self.encoder.encode（itemdict））

我只是尝试使用FanItemExporter从我的管道中调用它，并尝试导入的变体，但没有产生任何效果。
确实，Scrapy文档没有明确说明将项目导出器放置在何处。要使用项目导出器，请执行以下步骤
选择项目导出器类并将其导入项目目录中的pipeline.py
。它可以是预定义的项目导出器（例如XmlItemExporter
）或用户定义的（如问题中定义的FanItemExporter
）
在Pipeline.py
中创建项目管道类。在此类中实例化导入的项目导出器。详情将在回答的后面部分解释
现在，在settings.py
文件中注册这个管道类
以下是每个步骤的详细说明。每个步骤都包含问题的解决方案
第一步

如果使用预定义的项目导出器类，请从scrapy.exporters
模块导入该类。

前任：
从scrapy.exporters导入XmlItemExporter

如果需要自定义导出器，请在文件中定义自定义类。我建议将该类放在exporters.py
文件中。将此文件放在项目文件夹中（其中settings.py
，items.py
驻留）
创建新的子类时，最好导入baseitemporter
。如果我们打算完全改变功能，这将是合适的。但是，在这个问题中，大多数功能接近于JsonLinesItemExporter


因此，我附上两个版本的相同项目。一个版本扩展了baseitemporter
类，另一个版本扩展了JsonLinesItemExporter
类
版本1：扩展BaseItemExporter

由于BaseItemExporter
是父类，start\u exporting（）
，finish\u exporting（）
，export\u item（）
必须覆盖以满足我们的需要
from scrapy.exporters import BaseItemExporter
from scrapy.utils.serialize import ScrapyJSONEncoder
from scrapy.utils.python import to_bytes

class FanItemExporter(BaseItemExporter):

    def __init__(self, file, **kwargs):
        self._configure(kwargs, dont_fail=True)
        self.file = file
        self.encoder = ScrapyJSONEncoder(**kwargs)
        self.first_item = True

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

    def export_item(self, item):
        if self.first_item:
            self.first_item = False
        else:
            self.file.write(b',\n')
        itemdict = dict(self._get_serialized_fields(item))
        self.file.write(to_bytes(self.encoder.encode(itemdict)))

版本2：扩展JsonLinesItemExporter

JsonLinesItemExporter
提供了与export\u item（）
方法完全相同的实现。因此，仅覆盖start\u export（）
和finish\u export（）
方法
JsonLinesItemExporter的实现可以在文件夹python\u dir\pkgs\scrapy-1.1.0-py35\u 0\Lib\site packages\scrapy\exporters.py中看到
from scrapy.exporters import JsonItemExporter

class FanItemExporter(JsonItemExporter):

    def __init__(self, file, **kwargs):
        # To initialize the object using JsonItemExporter's constructor
        super().__init__(file)

    def start_exporting(self):
        self.file.write(b'{\'product\': [')

    def finish_exporting(self):
        self.file.write(b'\n]}')

注意：将数据写入文件时，请务必注意，标准项目导出器类需要二进制文件。因此，文件必须以二进制模式（b
）打开。出于同样的原因，两个版本中的write（）
方法都将字节写入文件
步骤2
创建项目管道类
from project_name.exporters import FanItemExporter

class FanExportPipeline(object):
    def __init__(self, file_name):
        # Storing output filename
        self.file_name = file_name
        # Creating a file handle and setting it to None
        self.file_handle = None

    @classmethod
    def from_crawler(cls, crawler):
        # getting the value of FILE_NAME field from settings.py
        output_file_name = crawler.settings.get('FILE_NAME')

        # cls() calls FanExportPipeline's constructor
        # Returning a FanExportPipeline object
        return cls(output_file_name)

    def open_spider(self, spider):
        print('Custom export opened')

        # Opening file in binary-write mode
        file = open(self.file_name, 'wb')
        self.file_handle = file

        # Creating a FanItemExporter object and initiating export
        self.exporter = FanItemExporter(file)
        self.exporter.start_exporting()

    def close_spider(self, spider):
        print('Custom Exporter closed')

        # Ending the export to file from FanItemExport object
        self.exporter.finish_exporting()

        # Closing the opened output file
        self.file_handle.close()

    def process_item(self, item, spider):
        # passing the item to FanItemExporter object for expoting to file
        self.exporter.export_item(item)
        return item

步骤3
由于项目导出管道已定义，请在settings.py
文件中注册此管道。还要将字段文件名
添加到settings.py
文件中。此字段包含输出文件的文件名
将以下行添加到settings.py
文件中
FILE_NAME = 'path/outputfile.ext'
ITEM_PIPELINES = {
    'project_name.pipelines.FanExportPipeline' : 600,
}

如果ITEM\u PIPELINES
已取消注释，则将以下行添加到ITEM\u PIPELINES
字典中
'project\u name.pipelines.FanExportPipeline'：600，

这是创建自定义项目导出管道的一种方法。
能否说明您是如何尝试使用导出器的，以及您遇到了哪些错误/结果？谢谢。嘿@Alexe，所以我试着打电话给管道中的出口商，但它没有检测到。有什么想法吗？你能把你输入的settings.py
放在你配置导出器的地方吗？从管道中如何称呼它？我在指南中看到了一个类似的例子，但使用了内置的导出器。不过，始终使用管道有点多余。没有办法绕过它吗？我怀疑这个解决方案已经过时（或即将过时），并且可能已被弃用，因为它正在使用爬虫程序。如果能找到一个替代解决方案，而不使用已经存在的类进行重写，那就太好了。例如，类似出口商的东西。（那个似乎也不起作用。）