Python 如何将大于RAM限制的gzip文件导入数据帧&引用；杀死9“；使用HDF5？_Python_Pandas_Dataframe_Gzip_Hdf5

Python 如何将大于RAM限制的gzip文件导入数据帧&引用；杀死9“；使用HDF5？

python pandas dataframe

Python 如何将大于RAM限制的gzip文件导入数据帧&引用；杀死9“；使用HDF5？,python,pandas,dataframe,gzip,hdf5,Python,Pandas,Dataframe,Gzip,Hdf5,我有一个大约90 GB的gzip。这在磁盘空间内，但比RAM大得多如何将其导入熊猫数据帧？我在命令行中尝试了以下操作： # start with Python 3.4.5 import pandas as pd filename = 'filename.gzip' # size 90 GB df = read_table(filename, compression='gzip') 但是，几分钟后，Python会关闭并执行Kill 9 在定义了数据库对象df之后，我计划将其保存到HDF5中

我有一个大约90 GB的

gzip

。这在磁盘空间内，但比RAM大得多

如何将其导入熊猫数据帧？我在命令行中尝试了以下操作：

# start with Python 3.4.5
import pandas as pd
filename = 'filename.gzip'   # size 90 GB
df = read_table(filename, compression='gzip')

但是，几分钟后，Python会关闭并执行

Kill 9

在定义了数据库对象df之后，我计划将其保存到HDF5中

正确的方法是什么？如何使用

pandas.read_table（）

执行此操作

我会这样做：

filename = 'filename.gzip'      # size 90 GB
hdf_fn = 'result.h5'
hdf_key = 'my_huge_df'
cols = ['colA','colB','colC','ColZ'] # put here a list of all your columns
cols_to_index = ['colA','colZ'] # put here the list of YOUR columns, that you want to index
chunksize = 10**6               # you may want to adjust it ... 

store = pd.HDFStore(hdf_fn)

for chunk in pd.read_table(filename, compression='gzip', header=None, names=cols, chunksize=chunksize):
    # don't index data columns in each iteration - we'll do it later
    store.append(hdf_key, chunk, data_columns=cols_to_index, index=False)

# index data columns in HDFStore
store.create_table_index(hdf_key, columns=cols_to_index, optlevel=9, kind='full')
store.close()

谢谢你！您可以根据脚本是否崩溃（如上所述）来调整

chunksize

参数？@Jianguohisang，是的，您可以做出有根据的猜测。。。例如，如果您的服务器有32GB的RAM和1M（

10**6

）行DF需要1GB，您可以将其增加到20M（

2*10**7

）并测试它，检查它是否给您带来速度优势…输入文件

filename.gzip

没有头

cols\u to\u index

指的是必须在数据框中已标记的列，对吗？要对无头gzip文件执行此操作，是否需要在上面的

pd.read_table（）

处为每次迭代中的数据列编制索引？这可能是低效的…@Jianguohishiang，我已经更新了我的答案-请check@JianguoHisiang，请打开一个新问题，用一个小的可重复样本数据集来描述这个问题