Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/351.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python pyarrow内存泄漏?_Python_Pandas_Parquet_Pyarrow - Fatal编程技术网

Python pyarrow内存泄漏?

Python pyarrow内存泄漏?,python,pandas,parquet,pyarrow,Python,Pandas,Parquet,Pyarrow,对于较大文件的解析,我需要依次在循环中写入大量拼花文件。然而,这个任务消耗的内存似乎在每次迭代中都会增加,而我希望它保持不变(因为不应该在内存中追加任何内容)。这使得它很难扩展 我已经添加了一个最小可复制的示例,它创建了10000个拼花地板和环形附件 import resource import random import string import pyarrow as pa import pyarrow.parquet as pq import pandas as pd def id_g

对于较大文件的解析,我需要依次在循环中写入大量拼花文件。然而,这个任务消耗的内存似乎在每次迭代中都会增加,而我希望它保持不变(因为不应该在内存中追加任何内容)。这使得它很难扩展

我已经添加了一个最小可复制的示例,它创建了10000个拼花地板和环形附件

import resource
import random
import string
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


def id_generator(size=6, chars=string.ascii_uppercase + string.digits):
    return ''.join(random.choice(chars) for _ in range(size))

schema = pa.schema([
                        pa.field('test', pa.string()),
                    ])

resource.setrlimit(resource.RLIMIT_NOFILE, (1000000, 1000000))
number_files = 10000
number_rows_increment = 1000
number_iterations = 100

writers = [pq.ParquetWriter('test_'+id_generator()+'.parquet', schema) for i in range(number_files)]

for i in range(number_iterations):
    for writer in writers:
        table_to_write = pa.Table.from_pandas(
                            pd.DataFrame({'test': [id_generator() for i in range(number_rows_increment)]}),
                            preserve_index=False,
                            schema = schema,
                            nthreads = 1)
        table_to_write = table_to_write.replace_schema_metadata(None)
        writer.write_table(table_to_write)
    print(i)

for writer in writers:
    writer.close()

有人知道是什么导致了这次泄漏以及如何防止它吗?

我们不确定是什么问题,但其他一些用户报告了尚未诊断的内存泄漏。我将您的示例添加到一个跟踪JIRA问题中

您能说明您的熊猫版本吗?熊猫:0.22.0 PyArrow:0.10.0请更新到
熊猫>=0.23
。Pandas中存在泄漏,这也会影响
pyarrow
。我尝试过,但内存泄漏相同。更新到
pyarrow==0.15.0
对我有帮助。