如何使用Pandas/Python查询HDF存储_Python_Pandas_Hdfs

如何使用Pandas/Python查询HDF存储

python pandas

如何使用Pandas/Python查询HDF存储,python,pandas,hdfs,Python,Pandas,Hdfs,为了管理我在进行分析时消耗的RAM量，我在hdf5（.h5）中存储了一个大数据集，我需要使用Pandas高效地查询这个数据集该数据集包含一套应用程序的用户性能数据。我只想从40个可能的字段中抽取几个字段，然后将生成的数据帧过滤到那些使用我感兴趣的少数应用程序之一的用户 # list of apps I want to analyze apps = ['a','d','f'] # Users.h5 contains only one field_table called 'df' store

为了管理我在进行分析时消耗的RAM量，我在hdf5（.h5）中存储了一个大数据集，我需要使用Pandas高效地查询这个数据集

该数据集包含一套应用程序的用户性能数据。我只想从40个可能的字段中抽取几个字段，然后将生成的数据帧过滤到那些使用我感兴趣的少数应用程序之一的用户

# list of apps I want to analyze
apps = ['a','d','f']

# Users.h5 contains only one field_table called 'df'
store = pd.HDFStore('Users.h5')

# the following query works fine
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1'])

# the following pseudo-query fails
df = store.select('df',columns=['account','metric1','metric2'],where=['Month==10','IsMessager==1', 'app in apps'])

我意识到字符串“app in apps”不是我想要的。这只是我希望实现的一个类似SQL的表示。我似乎无法以任何方式传递字符串列表，但必须有一种方式

现在，我只是在没有这个参数的情况下运行查询，然后在后续步骤中过滤掉我不想要的应用

df = df[df['app'].isin(apps)]

但这效率要低得多，因为所有的应用程序都需要先加载到内存中，然后我才能删除它们。在某些情况下，这是一个大问题，因为我没有足够的内存来支持整个未过滤的df。

您已经非常接近了

In [1]: df = DataFrame({'A' : ['foo','foo','bar','bar','baz'],
                        'B' : [1,2,1,2,1], 
                        'C' : np.random.randn(5) })

In [2]: df
Out[2]: 
     A  B         C
0  foo  1 -0.909708
1  foo  2  1.321838
2  bar  1  0.368994
3  bar  2 -0.058657
4  baz  1 -1.159151

[5 rows x 3 columns]

将存储写入表（注意，在0.12中，您将使用

table=True

，而不是

format='table'

）。记得在创建表时指定要查询的

data\u列

（也可以执行

data\u columns=True

）

在master/0.13语法中，isin是通过

query\u column=list\u of_values

实现的。这将以字符串的形式呈现给where

In [8]: pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
Out[8]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

语法在0.12中，这必须是一个列表（其中包含条件）

In [8]: pd.read_hdf('test.h5','df',where='A=["foo","bar"] & B=1')
Out[8]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]

In [11]: pd.read_hdf('test.h5','df',where=[pd.Term('A','=',["foo","bar"]),'B=1'])
Out[11]: 
     A  B         C
0  foo  1 -0.909708
2  bar  1  0.368994

[2 rows x 3 columns]