Python 从HDFStore给定索引列表中选择行_Python_Pandas

Python 从HDFStore给定索引列表中选择行

python pandas

Python 从HDFStore给定索引列表中选择行,python,pandas,Python,Pandas,我有一个大的数据集，无法放入内存。我目前将其存储在一个包含两个表的HDFStore中：df_hist是由一个MultiIndex索引的直方图集合，其中第一级标记直方图，第二级标记直方图箱df_参数包含用于生成每个直方图的模拟参数，并由直方图标签索引（即df_历史索引的第一级）我想做的是使用df_params表上的一些查询选择要加载的直方图子集，然后仅从df_hist加载相关行。如果数据能够存储在内存中，我会这样做： df_params = store['df_params'] df_hist

我有一个大的数据集，无法放入内存。我目前将其存储在一个包含两个表的HDFStore中：

df_hist

是由一个

MultiIndex

索引的直方图集合，其中第一级标记直方图，第二级标记直方图箱

df_参数

包含用于生成每个直方图的模拟参数，并由直方图标签索引（即

df_历史

索引的第一级）

我想做的是使用

df_params

表上的一些查询选择要加载的直方图子集，然后仅从

df_hist

加载相关行。如果数据能够存储在内存中，我会这样做：

df_params = store['df_params']
df_hist = store['df_hist']
selection = df_params.index[df_params['N']==64]
df = df_hist[df_hist.index.get_level_values('id').isin(selection)]

当

df_hist

太大而无法放入内存时，实现这一点的最佳方法是什么？理想情况下，可以做类似的事情

store.select('df_hist', where='id isin selection')

这在0.12中有效

In [15]: pd.read_hdf('hist.hdf','df',where=pd.Term('l1','=',selection.index.tolist()))
Out[15]: 
           data
l1 l2          
2  0   1.397368
   1   0.198522
   2   1.034036
   3   0.650406
   4   1.823683
3  0   0.045635
   1  -0.213975
   2  -1.221950
   3  -0.145615
   4  -1.187883
4  0  -0.782221
   1  -0.626280
   2  -0.331885
   3  -0.975978
   4   2.006322

这也适用于master/0.13

In [16]: pd.read_hdf('hist.hdf','df',where='l1=selection.index')
Out[16]: 
           data
l1 l2          
2  0   1.397368
   1   0.198522
   2   1.034036
   3   0.650406
   4   1.823683
3  0   0.045635
   1  -0.213975
   2  -1.221950
   3  -0.145615
   4  -1.187883
4  0  -0.782221
   1  -0.626280
   2  -0.331885
   3  -0.975978
   4   2.006322

在0.19时，我能够使用OP所需的技术：

indices = [3,5]
for df in store.select('df', where="index in indices", chunksize=100000):
    print df # prints rows with index 3 or 5.

将作为评论发布，但由于代表原因无法发布。

您尝试了什么？你看过报纸了吗？从这里的示例中：

store.select（'dfq'，“index>Timestamp（'20130104'）&columns=['A'，'B']）

听起来很像您想要的。您可能需要编写一个包装来进行选择，因为我不确定是否支持

isin

。此外，它需要采用

表格

格式，而不是

固定

传递

lhs=list\u of_值的查询

相当于

isin

谢谢，这正是我需要的。

In [16]: pd.read_hdf('hist.hdf','df',where='l1=selection.index')
Out[16]: 
           data
l1 l2          
2  0   1.397368
   1   0.198522
   2   1.034036
   3   0.650406
   4   1.823683
3  0   0.045635
   1  -0.213975
   2  -1.221950
   3  -0.145615
   4  -1.187883
4  0  -0.782221
   1  -0.626280
   2  -0.331885
   3  -0.975978
   4   2.006322

indices = [3,5]
for df in store.select('df', where="index in indices", chunksize=100000):
    print df # prints rows with index 3 or 5.