Python 如何在不运行MemoryError的情况下连接多个pandas.DataFrames_Python_Pandas_Memory_Memory Management

Python 如何在不运行MemoryError的情况下连接多个pandas.DataFrames

python pandas memory memory-management

Python 如何在不运行MemoryError的情况下连接多个pandas.DataFrames,python,pandas,memory,memory-management,Python,Pandas,Memory,Memory Management,我有三个数据帧，我正在尝试连接 concat_df = pd.concat([df1, df2, df3]) 这会导致内存错误。我如何解决这个问题请注意，大多数现有的类似问题都与读取大型文件时发生的内存错误有关。我没有这个问题。我已将文件读入数据帧。我只是无法连接这些数据。我建议您通过连接将数据帧放入单个csv文件中。然后读取您的csv文件执行： # write df1 content in file.csv df1.to_csv('file.csv', index=False) # ap

我有三个数据帧，我正在尝试连接

concat_df = pd.concat([df1, df2, df3])

这会导致内存错误。我如何解决这个问题

请注意，大多数现有的类似问题都与读取大型文件时发生的内存错误有关。我没有这个问题。我已将文件读入数据帧。我只是无法连接这些数据。

我建议您通过连接将数据帧放入单个csv文件中。然后读取您的csv文件

执行：

# write df1 content in file.csv
df1.to_csv('file.csv', index=False)
# append df2 content to file.csv
df2.to_csv('file.csv', mode='a', columns=False, index=False)
# append df3 content to file.csv
df3.to_csv('file.csv', mode='a', columns=False, index=False)

# free memory
del df1, df2, df3

# read all df1, df2, df3 contents
df = pd.read_csv('file.csv')

如果此解决方案的性能不高，请使用比通常更大的文件。做：

df1.to_csv('file.csv', index=False)
df2.to_csv('file1.csv', index=False)
df3.to_csv('file2.csv', index=False)

del df1, df2, df3

然后运行bash命令：

cat file1.csv >> file.csv
cat file2.csv >> file.csv
cat file3.csv >> file.csv

或python中的concat csv文件：

def concat(file1, file2):
    with open(file2, 'r') as filename2:
        data = file2.read()
    with open(file1, 'a') as filename1:
        file.write(data)

concat('file.csv', 'file1.csv')
concat('file.csv', 'file2.csv')
concat('file.csv', 'file3.csv')

阅读后：

df = pd.read_csv('file.csv')

与@glegoux所建议的类似，还可以在append模式下编写pd.DataFrame.to_csv，因此您可以执行以下操作：

df1.to_csv(filename)
df2.to_csv(filename, mode='a', columns=False)
df3.to_csv(filename, mode='a', columns=False)

del df1, df2, df3
df_concat = pd.read_csv(filename)

这里有点猜测，但也许：

df1 = pd.concat([df1,df2])
del df2
df1 = pd.concat([df1,df3])
del df3

显然，您可以将其作为一个循环来执行，但关键是您希望在执行过程中删除df2、df3等。当您在问题中这样做时，您永远不会清除旧的数据帧，因此您使用的内存大约是所需内存的两倍

更一般地说，如果你在阅读和表达，我会这样做（如果你有3个CSV:foo0，foo1，foo2）：

换句话说，当您读取文件时，您只是暂时将小数据帧保存在内存中，直到您将它们连接到组合的df，concat_df。在当前执行此操作时，您将保留所有较小的数据帧，即使在连接它们之后也是如此。

另一个选项：

1）将

df1

写入.csv文件：

df1.to_csv（'Big file.csv'）

2）打开.csv文件，然后追加

df2

：

with open('Big File.csv','a') as f:
    df2.to_csv(f, header=False)

3）使用

df3

with open('Big File.csv','a') as f:
    df3.to_csv(f, header=False)

您可以将各个数据帧存储在HDF中，然后像调用一个大数据帧一样调用存储

# name of store
fname = 'my_store'

with pd.get_store(fname) as store:

    # save individual dfs to store
    for df in [df1, df2, df3, df_foo]:
        store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending

    # access the store as a single df
    df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)
    # Do other stuff with df #

# close the store when you're done
os.remove(fname)

Dask可能是处理大型数据帧的一个不错的选择-通过查看我在尝试将大量数据帧连接到“增长”数据帧时遇到了类似的性能问题

我的解决方法是将所有子数据帧附加到一个列表中，然后在子数据帧处理完成后连接数据帧列表。这将使运行时间几乎减少一半。

正如在其他答案中看到的那样，问题是内存问题。解决方案是将数据存储在磁盘上，然后构建一个独特的数据帧

# name of store
fname = 'my_store'

with pd.get_store(fname) as store:

    # save individual dfs to store
    for df in [df1, df2, df3, df_foo]:
        store.append('df',df,data_columns=['FOO','BAR','ETC']) # data_columns = identify the column in the dfs you are appending

    # access the store as a single df
    df = store.select('df', where = ['A>2'])  # change where condition as required (see documentation for examples)
    # Do other stuff with df #

# close the store when you're done
os.remove(fname)

对于如此庞大的数据，性能是一个问题

csv解决方案非常慢，因为在文本模式下会发生转换。由于使用二进制模式，HDF5解决方案更短、更优雅、更快。我提出了二进制模式下的第三种方法，使用，这似乎更快，但更技术，需要更多的空间。第四个是手工的

代码如下：

import numpy as np
import pandas as pd

# a DataFrame factory:
dfs=[]
for i in range(10):
    dfs.append(pd.DataFrame(np.empty((10**5,4)),columns=range(4)))

# a csv solution
def bycsv(dfs):
    md,hd='w',True
    for df in dfs:
        df.to_csv('df_all.csv',mode=md,header=hd,index=None)
        md,hd='a',False
    #del dfs
    df_all=pd.read_csv('df_all.csv',index_col=None)
    os.remove('df_all.csv') 
    return df_all

更好的解决方案：

def byHDF(dfs):
    store=pd.HDFStore('df_all.h5')
    for df in dfs:
        store.append('df',df,data_columns=list('0123'))
    #del dfs
    df=store.select('df')
    store.close()
    os.remove('df_all.h5')
    return df

def bypickle(dfs):
    c=[]
    with open('df_all.pkl','ab') as f:
        for df in dfs:
            pickle.dump(df,f)
            c.append(len(df))    
    #del dfs
    with open('df_all.pkl','rb') as f:
        df_all=pickle.load(f)
        offset=len(df_all)
        df_all=df_all.append(pd.DataFrame(np.empty(sum(c[1:])*4).reshape(-1,4)))

        for size in c[1:]:
            df=pickle.load(f)
            df_all.iloc[offset:offset+size]=df.values 
            offset+=size
    os.remove('df_all.pkl')
    return df_all

对于同质数据帧，我们可以做得更好：

def byhand(dfs):
    mtot=0
    with open('df_all.bin','wb') as f:
        for df in dfs:
            m,n =df.shape
            mtot += m
            f.write(df.values.tobytes())
            typ=df.values.dtype                
    #del dfs
    with open('df_all.bin','rb') as f:
        buffer=f.read()
        data=np.frombuffer(buffer,dtype=typ).reshape(mtot,n)
        df_all=pd.DataFrame(data=data,columns=list(range(n))) 
    os.remove('df_all.bin')
    return df_all

和一些测试（很少，32MB）数据来比较性能。对于4GB，您必须乘以大约128

In [92]: %time w=bycsv(dfs)
Wall time: 8.06 s

In [93]: %time x=byHDF(dfs)
Wall time: 547 ms

In [94]: %time v=bypickle(dfs)
Wall time: 219 ms

In [95]: %time y=byhand(dfs)
Wall time: 109 ms

支票：

In [195]: (x.values==w.values).all()
Out[195]: True

In [196]: (x.values==v.values).all()
Out[196]: True

In [197]: (x.values==y.values).all()
Out[196]: True

当然，所有这些都必须改进和调整，以适应您的问题

例如，df3可以拆分为大小为“total_memory_size-df_total_size”的chuncks，以便能够通过pickle运行


我可以编辑它，如果你给你的数据结构和大小的更多信息，如果你想。漂亮的问题
 我感谢社区的回答。然而，在我的例子中，我发现问题实际上是由于我使用的是32位Python
有针对Windows 32和64位操作系统的定义。对于32位进程，它只有2 GB。因此，即使您的RAM超过2GB，即使您运行的是64位操作系统，但您运行的是32位进程，那么该进程将仅限于2GB的RAM—在我的例子中，该进程是Python
我升级到64位Python，从那以后就没有内存错误了
其他相关问题包括：，
在写入硬盘时，df.to_csv
为columns=False抛出错误
以下解决方案效果良好：
# write df1 to hard disk as file.csv
train1.to_csv('file.csv', index=False)
# append df2 to file.csv
train2.to_csv('file.csv', mode='a', header=False, index=False)
# read the appended csv as df
train = pd.read_csv('file.csv')

这些是时间序列吗？你想把它们定在日期上吗？我想定在索引上。这不是一个时间序列。你是否因为不想写文件而增加了赏金？@IanS只是想引起更多的注意，看看写csv是唯一的选择，还是有更优雅的解决方案。嗯，我唯一的另一个想法是按照约翰在回答中的建议去做……但是如果我们想沿着列连接，即轴=1
，那么你的答案就行不通了！不适用于大文件或内存错误当前应使用header=False
而不是columns=False
@abhilashawashi在磁盘上转储文件后，粘贴命令可能是一个更好的选项。最后一个选项不起作用“AttributeError:'str'对象没有属性'read'”我试图手动使用解决方案，
，但收到一个错误：无法从内存缓冲区创建对象数组。我不确定它是否可以在Python3中修复。很棒的帖子！您需要做些什么来按列连接这些内容呢？非常棒的比较，非常有用，谢谢！