Python 读写40GB CSV时出现MemoryError。。。我的漏洞在哪里？_Python_Python 3.x_Pandas_Memory

Python 读写40GB CSV时出现MemoryError。。。我的漏洞在哪里？

python python-3.x pandas memory

Python 读写40GB CSV时出现MemoryError。。。我的漏洞在哪里？,python,python-3.x,pandas,memory,Python,Python 3.x,Pandas,Memory,我有一个40GB的CSV文件，我必须再次将不同的列子集作为CSV输出，并检查数据中是否没有NaNs。我选择使用Pandas，我的实现的一个最小示例如下所示（在函数输出\u不同格式中）：为什么我会在这里得到一个记忆体？我的程序中是否有内存泄漏，或者我是否误解了什么？或者，程序可以被煽动只是随机失败写在CSV的特定块，也许我应该考虑减少块大小？完全回溯：回溯（最近一次呼叫最后一次）：文件“D:/AppData/A/MRM/Eric/output_formats.py”，第128行，在输出不

我有一个40GB的CSV文件，我必须再次将不同的列子集作为CSV输出，并检查数据中是否没有

NaN

s。我选择使用Pandas，我的实现的一个最小示例如下所示（在函数

输出\u不同格式

中）：

为什么我会在这里得到一个记忆体？我的程序中是否有内存泄漏，或者我是否误解了什么？或者，程序可以被煽动只是随机失败写在CSV的特定块，也许我应该考虑减少块大小？完全回溯：

回溯（最近一次呼叫最后一次）：
文件“D:/AppData/A/MRM/Eric/output_formats.py”，第128行，在
输出不同的格式（真实世界=错误）
文件“D:/AppData/A/MRM/Eric/output_formats.py”，第50行，时钟
结果=函数（*args，**kwargs）
文件“D:/AppData/A/MRM/Eric/output\u formats.py”，第116行，输出格式不同
mode='a'，header=True，index=False，compression='gzip'）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\frame.py”，第1188行，输入到\u csv
十进制=十进制）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\format.py”，第1293行，在\uu init中__
self.obj=self.obj.loc[：，cols]
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\index.py”，第1187行，位于\uu getitem中__
返回self.\u getitem\u元组（键）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\index.py”，第720行，在\u getitem\u元组中
retval=getattr（retval，self.name）。\u getitem\u轴（键，轴=i）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\index.py”，第1323行，在\u getitem\u轴中
返回self.\u getitem\u iterable（键，轴=轴）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\index.py”，第966行，在\u getitem\u iterable中
结果=自身对象重新索引轴（键轴=轴，级别=级别）
reindex\U轴中的文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\frame.py”，第2519行
填充值=填充值）
reindex\U轴中的文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\generic.py”，第1852行
{axis:[新索引，索引器]}，填充值=填充值，复制=复制）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\generic.py”，第1876行，在带有索引器的reindex中
复制=复制）
reindex\U索引器中的文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\internals.py”，第3157行
索引器，填充元组=（填充值，）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\internals.py”，第3238行，位于ax0中的切片块中
新建（管理器位置=管理器位置，填充（元组=无））
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\internals.py”，第853行，在take\n中
允许（填充=假）
文件“D:\AppData\A\MRM\Eric\Anaconda\lib\site packages\pandas\core\common.py”，第838行，在take\n中
out=np.empty（out\u形状，dtype=dtype）
记忆者

目前的解决方案是使用

gc.collect（）

当场景小于10000时：场景=场景。获取区块（区块大小）如果scenario.isnull（）.values.any（）： #某些错误处理（尚未发生）对于输出_名称中的项：方案.到_csv（项目，浮动_格式='%.8f'， columns=列映射[item]， mode='a'，header=True，index=False，compression='gzip'） gc.collect（） gc.collect（）

添加这些行之后，内存消耗保持稳定，但是我仍然不清楚为什么这种方法会出现内存问题

也许您可以尝试调用循环中的垃圾收集器（

gc.collect（）

）。作为一种解决方法，您还可以尝试64位版本的python。@Jean Françoisfare现在尝试使用

gc.collect（）

，但在接下来的几个小时内不知道是否成功。为什么64位Python会有帮助？64位Python允许更多的内存分配（当然，您需要系统和64位windows上的物理内存/交换）。这不会修复内存泄漏，但会延迟它，希望直到程序终止。@Jean-franoisfabre我明白了。如果

gc.collect（）

能马上解决这个问题，我会告诉你的，谢谢你的帮助！这似乎与内存泄漏无关，因为在这种情况下调用垃圾收集器对您没有帮助。内存分配很可能是在您使用的各种库中隐式完成的。作为一名图书馆用户，你可以做任何事情，我会感到惊讶。我很想知道图书馆用户在你的情况下应该做什么。：）但我无法想象你在这里造成了任何内存泄漏，你的库也没有。

# column_names is a huge list containing the column union of all the output
#  column subsets
scen_iter = pd.read_csv('mybigcsv.csv', header=0, index_col=False,
                        iterator=True, na_filter=False,
                        usecols=column_names)
CHUNKSIZE = 630100
scen_cnt = 0
output_names = ['formatA', 'formatB', 'formatC', 'formatD', 'formatE']
# column_mappings is a dictionary mapping the output names to their
#  respective column subsets. 
while scen_cnt < 10000:
    scenario = scen_iter.get_chunk(CHUNKSIZE)
    if scenario.isnull().values.any():
        # some error handling (has yet to ever occur)
    for item in output_names:
        scenario.to_csv(item, float_format='%.8f',
                        columns=column_mappings[item],
                        mode='a', header=True, index=False, compression='gzip')

    scen_cnt+=100

  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
    out = np.empty(out_shape, dtype=dtype)
MemoryError

Traceback (most recent call last):
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 128, in <module>
    output_different_formats(real_world=False)
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 50, in clocked
    result = func(*args, **kwargs)
  File "D:/AppData/A/MRM/Eric/output_formats.py", line 116, in output_different_formats
    mode='a', header=True, index=False, compression='gzip')
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 1188, in to_csv
    decimal=decimal)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\format.py", line 1293, in __init__
    self.obj = self.obj.loc[:, cols]
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1187, in __getitem__
    return self._getitem_tuple(key)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 720, in _getitem_tuple
    retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 1323, in _getitem_axis
    return self._getitem_iterable(key, axis=axis)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\indexing.py", line 966, in _getitem_iterable
    result = self.obj.reindex_axis(keyarr, axis=axis, level=level)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\frame.py", line 2519, in reindex_axis
    fill_value=fill_value)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1852, in reindex_axis
    {axis: [new_index, indexer]}, fill_value=fill_value, copy=copy)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\generic.py", line 1876, in _reindex_with_indexers
    copy=copy)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3157, in reindex_indexer
    indexer, fill_tuple=(fill_value,))
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 3238, in _slice_take_blocks_ax0
    new_mgr_locs=mgr_locs, fill_tuple=None))
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\internals.py", line 853, in take_nd
    allow_fill=False)
  File "D:\AppData\A\MRM\Eric\Anaconda\lib\site-packages\pandas\core\common.py", line 838, in take_nd
    out = np.empty(out_shape, dtype=dtype)
MemoryError

while scen_cnt < 10000:
    scenario = scen_iter.get_chunk(CHUNKSIZE)
    if scenario.isnull().values.any():
        # some error handling (has yet to ever occur)
    for item in output_names:
        scenario.to_csv(item, float_format='%.8f',
                        columns=column_mappings[item],
                        mode='a', header=True, index=False, compression='gzip')
        gc.collect()
    gc.collect()