Python 大量同质数据帧的快速拼接_Python_Pandas

Python 大量同质数据帧的快速拼接

python pandas

Python 大量同质数据帧的快速拼接,python,pandas,Python,Pandas,我有大约7000个同质数据帧（相同的列，但大小不同），希望将它们合并到一个大数据帧中进行进一步分析如果我生成所有表并存储到列表中，内存将爆炸，因此我无法使用pandas.concat（[…我的所有表…]），而是选择执行以下操作： bit_table = None for table in readTables(): big_table = pandas.concat([big_table, table], ignore_index=True) 我想知道for loop方法与panda

我有大约7000个同质数据帧（相同的列，但大小不同），希望将它们合并到一个大数据帧中进行进一步分析

如果我生成所有表并存储到

列表中

，内存将爆炸，因此我无法使用

pandas.concat（[…我的所有表…]）

，而是选择执行以下操作：

bit_table = None
for table in readTables():
    big_table = pandas.concat([big_table, table], ignore_index=True)

我想知道

for loop

方法与

pandas.concat（[…所有表…]）

方法相比的效率。他们的速度一样吗

由于表是同质的，索引也无关紧要，有没有什么技巧可以加速连接？

下面是一个使用

pd.HDFStore

将多个表附加到一起的示例

import pandas as pd
import numpy as np
from time import time

# your tables
# =========================================
columns = ['col{}'.format(i) for i in range(100)]
data = np.random.randn(100000).reshape(1000, 100)
df = pd.DataFrame(data, columns=columns)

# many tables, generator
def get_generator(df, n=1000):
    for x in range(n):
        yield df

table_reader = get_generator(df, n=1000)


# processing
# =========================================
# create a hdf5 storage, compression level 5, (1-9, 9 is extreme)
h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5', complevel=5, complib='blosc')

Out[2]: 
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
Empty


t0 = time()

# loop over your df
counter = 1
for frame in table_reader:
    print('Appending Table {}'.format(counter))
    h5_file.append('big_table', frame, complevel=5, complib='blosc')
    counter += 1

t1 = time()

# Appending Table 1
# Appending Table 2
# ...
# Appending Table 999
# Appending Table 1000


print(t1-t0)

Out[3]: 41.6630880833

# check our hdf5_file
h5_file

Out[7]: 
<class 'pandas.io.pytables.HDFStore'>
File path: /home/Jian/Downloads/my_hdf5_file.h5
/big_table            frame_table  (typ->appendable,nrows->1000000,ncols->100,indexers->[index])

# close hdf5
h5_file.close()

# very fast to retrieve your data in any future IPython session

h5_file = pd.HDFStore('/home/Jian/Downloads/my_hdf5_file.h5')

%time my_big_table = h5_file['big_table']

CPU times: user 217 ms, sys: 1.11 s, total: 1.33 s
Wall time: 1.89 s

将熊猫作为pd导入
将numpy作为np导入
从时间导入时间
#你的桌子
# =========================================
columns=['col{}'。范围（100）内i的格式（i）]
数据=np.random.randn（100000）。重塑（1000100）
df=pd.DataFrame（数据，列=列）
#许多桌子、发电机
def get_发生器（df，n=1000）：
对于范围（n）内的x：
产量df
表\u读取器=获取\u生成器（df，n=1000）
#加工
# =========================================
#创建hdf5存储，压缩级别5，（1-9，9是极限）
h5_file=pd.HDFStore（'/home/Jian/Downloads/my_hdf5_file.h5'，complevel=5，complib='blosc'））
出[2]：
文件路径：/home/Jian/Downloads/my_hdf5_File.h5
空的
t0=时间（）
#绕着你的df转一圈
计数器=1
对于表_中的帧读取器：
打印（'追加表{}'。格式（计数器））
h5_file.append（'big_table'，frame，complevel=5，complib='blosc'））
计数器+=1
t1=时间（）
#附表1
#附表2
# ...
#附表999
#附表1000
打印（t1-t0）
Out[3]：41.6630880833
#检查我们的hdf5\u文件
h5_文件
出[7]：
文件路径：/home/Jian/Downloads/my_hdf5_File.h5
/大表格框架表格（典型->可追加，nrows->1000000，ncols->100，索引器->索引）
#关闭hdf5
h5_文件.close（）
#在未来的IPython会话中检索数据非常快
h5_file=pd.HDFStore（'/home/Jian/Downloads/my_hdf5_file.h5'）
%time my_big_table=h5_文件['big_table']
CPU时间：用户217毫秒，系统：1.11秒，总计：1.33秒
壁时间：1.89秒

如果都是同质的，那么使用numpy的

hstack

或

vstack

可能是一个不错的选择。但我不确定这是否会有多大帮助。看起来您只是想将所有表（垂直）附加在一起。如果是这样的话，你可能想在<代码> Pd.HDFStuts中考虑<代码>附录< /代码>。非常感谢@简迅丽。你的想法让我想起了像mapreduce这样使用硬盘存储的真正的大数据。非常有趣！