Python 如何增量写入json文件_Python_Json_Dictionary

Python 如何增量写入json文件

python json dictionary

Python 如何增量写入json文件,python,json,dictionary,Python,Json,Dictionary,我正在编写一个程序，需要生成一个非常大的json文件。我知道传统的方法是使用json.dump（）转储字典列表，但是列表太大了，即使是总内存+交换空间也无法在转储之前保存它。有没有办法将数据流式传输到json文件中，即以增量方式将数据写入json文件？遗憾的是json库没有任何增量写入功能，因此无法完成您想要的操作这显然将是一个非常大的文件-是否有其他更合适的表示形式否则，我能提出的最佳建议是将每个列表条目转储到内存结构中，并使用必要的分隔符（[开头、]、条目之间的[和]结尾）将其写出，以尝

我正在编写一个程序，需要生成一个非常大的

json

文件。我知道传统的方法是使用

json.dump（）

转储字典列表，但是列表太大了，即使是总内存+交换空间也无法在转储之前保存它。有没有办法将数据流式传输到

json

文件中，即以增量方式将数据写入

json

文件？

遗憾的是

json

库没有任何增量写入功能，因此无法完成您想要的操作

这显然将是一个非常大的文件-是否有其他更合适的表示形式

否则，我能提出的最佳建议是将每个列表条目转储到内存结构中，并使用必要的分隔符（

开头、

]、条目之间的[

和

结尾）将其写出，以尝试构建所需的JSON

如果格式化很重要，您应该知道程序编写的包装器测试将破坏正确的缩进，但缩进仅适用于人类，因此它不会对JSON结构的语义产生任何影响。

我知道这已经晚了一年，但问题仍然存在，我很惊讶没有提到缩进

在本例中，

iterencode

的潜在问题是，您可能希望通过使用生成器对大型数据集进行迭代处理，而json encode不会序列化生成器

解决这个问题的方法是使用子类列表类型并重写

\uuuu iter\uuuu

魔术方法，以便生成生成器的输出

下面是此列表子类的一个示例

class StreamArray(list):
    """
    Converts a generator into a list object that can be json serialisable
    while still retaining the iterative nature of a generator.

    IE. It converts it to a list without having to exhaust the generator
    and keep it's contents in memory.
    """
    def __init__(self, generator):
        self.generator = generator
        self._len = 1

    def __iter__(self):
        self._len = 0
        for item in self.generator:
            yield item
            self._len += 1

    def __len__(self):
        """
        Json parser looks for a this method to confirm whether or not it can
        be parsed
        """
        return self._len

从这里开始的用法非常简单。获取生成器句柄，将其传递到

StreamArray

类中，将StreamArray对象传递到

iterencode（）

中，并迭代块。区块将是json格式的输出，可以直接写入文件

用法示例：

#Function that will iteratively generate a large set of data.
def large_list_generator_func():
    for i in xrange(5):
        chunk = {'hello_world': i}
        print 'Yielding chunk: ', chunk
        yield chunk

#Write the contents to file:
with open('/tmp/streamed_write.json', 'w') as outfile:
    large_generator_handle = large_list_generator_func()
    stream_array = StreamArray(large_generator_handle)
    for chunk in json.JSONEncoder().iterencode(stream_array):
        print 'Writing chunk: ', chunk
        outfile.write(chunk)

显示产量和写入连续发生的输出

Yielding chunk:  {'hello_world': 0}
Writing chunk:  [
Writing chunk:  {
Writing chunk:  "hello_world"
Writing chunk:  : 
Writing chunk:  0
Writing chunk:  }
Yielding chunk:  {'hello_world': 1}
Writing chunk:  , 
Writing chunk:  {
Writing chunk:  "hello_world"
Writing chunk:  : 
Writing chunk:  1
Writing chunk:  }

您还可以假设您有一个iterable

it

，希望将其作为一个大型JSON记录数组写入文件句柄

fh

，然后执行以下操作，我认为这是最简单的方法：

def write_json_iter(it, fh):
    print("[", file=fh)
    for n, rec in enumerate(it):
        if n > 0:
            print(",", file=fh)
        json.dump(rec, fh)
    print("]", file=fh)

您可以在文件开头插入一个

，然后写入每个值的转储，作为数组的新元素。关闭文件时，使用

终止它。格式不重要。json文件用于将大量文档集合索引到ApacheSolr中。我一定会试试你的方法，然后检查它是否正确。非常感谢。美好的我只是查看了默认情况下json.dump（）使用的iterencode。简而言之，这个答案是：编写自己的转储函数。显然这并不难。