Python MemoryError在尝试从read_csv获取巨大数据帧时出错_Python_Pandas_Out Of Memory_Concat

Python MemoryError在尝试从read_csv获取巨大数据帧时出错

python pandas

Python MemoryError在尝试从read_csv获取巨大数据帧时出错,python,pandas,out-of-memory,concat,Python,Pandas,Out Of Memory,Concat,我有一个很大的数据集，其中包含大约109G的数据，这些数据被分成245.vsc.gz文件。我试着打开每一个，然后加入所有的。第一步是确定的，我成功地实现了我想要的，但后来当我尝试连接它们时，我收到了以下错误` Traceback (most recent call last): File "CleanDataVR2.py", line 92, in <module> df_concated = pd.concat(dl) File "/share/apps/anaco

我有一个很大的数据集，其中包含大约109G的数据，这些数据被分成245.vsc.gz文件。我试着打开每一个，然后加入所有的。第一步是确定的，我成功地实现了我想要的，但后来当我尝试连接它们时，我收到了以下错误`

Traceback (most recent call last):
  File "CleanDataVR2.py", line 92, in <module>
    df_concated = pd.concat(dl)
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/tools/merge.py", line 835, in concat
    return op.get_result()
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/tools/merge.py", line 1025, in get_result
    concat_axis=self.axis, copy=self.copy)
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/core/internals.py", line 4474, in concatenate_block_managers
    for placement, join_units in concat_plan]
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/core/internals.py", line 4579, in concatenate_join_units
    concat_values = com._concat_compat(to_concat, axis=concat_axis)
  File "/share/apps/anaconda/2/2.5.0/lib/python2.7/site-packages/pandas/core/common.py", line 2741, in _concat_compat
    return np.concatenate(to_concat, axis=axis)
MemoryError

名为

volume\u filter

的函数只是我编写的一个函数，用于获取名为

Tage

的列。因为这部分很好用，所以我没有把它放在这里。每个数据帧的结构如下所示

            #RIC      Date[G]       Time[G]    Price  Volume   Tage
0         URKAq.L  01-SEP-2008  11:19:46.800   45.000   152.0   T200
8         URKAq.L  01-SEP-2008  11:28:53.769   45.000  2848.0  T3000
9         URKAq.L  01-SEP-2008  11:28:53.769   45.000  1725.0  T2000
11        URKAq.L  01-SEP-2008  11:28:53.844   45.000   427.0   T500
13        URKAq.L  01-SEP-2008  11:28:53.898   45.000   450.0   T500
15        URKAq.L  01-SEP-2008  11:28:53.981   45.000   200.0   T200
20        URKAq.L  01-SEP-2008  11:28:54.124   45.000   850.0   T900
21        URKAq.L  01-SEP-2008  11:28:54.124   45.000  1073.0  T2000
24        URKAq.L  01-SEP-2008  11:28:54.329   45.000   200.0   T200
25        URKAq.L  01-SEP-2008  11:28:54.617   44.965   310.0   T400
26        URKAq.L  01-SEP-2008  11:28:54.617   44.965   310.0   T400
29        URKAq.L  01-SEP-2008  11:29:04.522   45.025   620.0   T700
30        URKAq.L  01-SEP-2008  11:29:04.769   45.025   620.0   T700
31        URKAq.L  01-SEP-2008  11:30:21.974   45.000  2800.0  T3000
32        URKAq.L  01-SEP-2008  11:30:21.974   45.000   700.0   T700
35        URKAq.L  01-SEP-2008  11:30:22.036   45.000   679.0   T700
39        URKAq.L  01-SEP-2008  11:30:22.110   45.000   200.0   T200
40        URKAq.L  01-SEP-2008  11:30:22.114   45.000   250.0   T300
42        URKAq.L  01-SEP-2008  11:30:22.405   45.025   243.0   T300
43        URKAq.L  01-SEP-2008  11:30:22.663   45.025   243.0   T300
44        URKAq.L  01-SEP-2008  11:30:23.737   45.000  2550.0  T3000
47        URKAq.L  01-SEP-2008  11:30:23.769   45.000  1500.0  T2000
51        URKAq.L  01-SEP-2008  11:30:23.769   44.920   200.0   T200
52        URKAq.L  01-SEP-2008  11:30:23.856   44.900   150.0   T200
54        URKAq.L  01-SEP-2008  11:30:25.101   45.000  1901.0  T2000
56        URKAq.L  01-SEP-2008  11:30:25.145   44.900   650.0   T700
58        URKAq.L  01-SEP-2008  11:30:25.145   44.900   200.0   T200
64        URKAq.L  01-SEP-2008  11:30:37.648   44.950   195.0   T200
65        URKAq.L  01-SEP-2008  11:30:37.829   44.950   195.0   T200
68        URKAq.L  01-SEP-2008  11:30:47.743   44.950  1031.0  T2000

我尝试了@B.M.的一种解决方案

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

但我还是犯了以下错误

Traceback (most recent call last):
  File "CleanDataVR2.py", line 107, in <module>
    byhand(dl)
  File "CleanDataVR2.py", line 102, in byhand
    data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
ValueError: cannot create an OBJECT array from memory buffer

回溯（最近一次呼叫最后一次）：
文件“CleanDataVR2.py”，第107行，在
手工（dl）
文件“CleanDataVR2.py”，第102行，手动
数据=np.frombuffer（buffer，dtype=typ）
ValueError:无法从内存缓冲区创建对象数组

现在我真的很困惑。我在一台超级计算机上用2个文件运行了自己的代码，一切都很有趣。为什么我发现更多文件有错误？另外，我使用了5*256GB的内存，应该足够了。

@B.M.您能帮我解答这个问题吗？谢谢。你真的需要同时把它们都记在记忆里吗？此外，您需要将每列的数据类型指定为

pd.read\u csv

，否则您将以字符串形式读取每列，而不是（在读取时立即转换）以日期、时间或浮点形式读取，这将浪费千兆字节。请集中精力改进仅对一个文件执行的

pd.read\u csv

命令，以便将内存使用量大幅降低到绝对最小。然后在问题中张贴修改后的代码。另外，看起来

Volume

是一个整数，而不是浮点数。您可以避免更多的内存浪费，而不是使用

df=df[[''RIC'，'Date[G]'，'Time[G]'，'Price'，'Volume']]

。只需直接使用，即可只读取所需列的子集；并使用

dtypes

参数仅将其作为预期的数据类型直接读取，而不是将其作为不需要的千兆字节字符串。请阅读并了解其中的哪些选项可减少内存使用量。通过读取单个文件的前1000行来测试所有这些，并不断积极减少其内存使用量，直到你不能做得更好。

Traceback (most recent call last):
  File "CleanDataVR2.py", line 107, in <module>
    byhand(dl)
  File "CleanDataVR2.py", line 102, in byhand
    data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
ValueError: cannot create an OBJECT array from memory buffer