Python 在导入时筛选pytables表_Python_Pandas_Pytables

Python 在导入时筛选pytables表

python pandas

Python 在导入时筛选pytables表,python,pandas,pytables,Python,Pandas,Pytables,我有一个用pytables创建的数据集，我正试图将其导入到pandas数据框架中。我无法将where过滤器应用于read\u hdf步骤。我在熊猫“0.12.0”节目上我的示例pytables数据： import tables import pandas as pd import numpy as np class BranchFlow(tables.IsDescription): branch = tables.StringCol(itemsize=25, dflt=' ')

我有一个用pytables创建的数据集，我正试图将其导入到pandas数据框架中。我无法将

where

过滤器应用于

read\u hdf

步骤。我在熊猫“0.12.0”节目上

我的示例pytables数据：

import tables
import pandas as pd
import numpy as np

class BranchFlow(tables.IsDescription):
    branch = tables.StringCol(itemsize=25, dflt=' ')
    flow = tables.Float32Col(dflt=0)

filters = tables.Filters(complevel=8)
h5 = tables.openFile('foo.h5', 'w')
tbl = h5.createTable('/', 'BranchFlows', BranchFlow, 
            'Branch Flows', filters=filters, expectedrows=50e6) 

for i in range(25):
    element = tbl.row
    element['branch'] = str(i)
    element['flow'] = np.random.randn()
    element.append()
tbl.flush()
h5.close()

我可以很好地将其导入数据帧：

store = pd.HDFStore('foo.h5')
print store
print pd.read_hdf('foo.h5', 'BranchFlows').head()

这表明：

In [10]: print store
<class 'pandas.io.pytables.HDFStore'>
File path: foo.h5
/BranchFlows            frame_table [0.0.0] (typ->generic,nrows->25,ncols->2,indexers->[index],dc->[branch,flow])

In [11]: print pd.read_hdf('foo.h5', 'BranchFlows').head()
  branch      flow
0      0 -0.928300
1      1 -0.256454
2      2 -0.945901
3      3  1.090994
4      4  0.350750

[10]中的

：打印存储
文件路径：foo.h5
/BranchFlows框架_表[0.0.0]（典型->通用，nrows->25，ncols->2，索引器->[index]，dc->[branch，flow]）
在[11]中：print pd.read_hdf（'foo.h5'，'BranchFlows'）.head（）
分支流
0      0 -0.928300
1      1 -0.256454
2      2 -0.945901
3      3  1.090994
4      4  0.350750

但我无法让过滤器在flow列上工作：

pd.read_hdf('foo.h5', 'BranchFlows', where=['flow>0.5'])

<snip traceback>

TypeError: passing a filterable condition to a non-table indexer [field->flow,op->>,value->[0.5]]

pd.read\u hdf（'foo.h5'，'BranchFlows'，其中=['flow>0.5']））
TypeError:将可筛选条件传递给非表索引器[field->flow，op->>，value->[0.5]]

从PyTables直接创建的表中读取仅允许您直接读取（整个）表。为了使用pandas选择机制，您必须使用pandas工具（表格格式）编写它（因为pandas需要的元数据不存在-可以完成，但需要一些工作）

所以，像上面一样阅读表格，然后创建一个新表格，并指出表格格式。看

[6]中的

df.to_hdf（'foo.h5'，'BranchFlowsTable'，data_columns=True，table=True）
在[24]中：以pd.get_store（'foo.h5'）作为存储：
印刷品（商店）
....:     
文件路径：foo.h5
/BranchFlows框架_表[0.0.0]（典型->通用，nrows->25，ncols->2，索引器->[index]，dc->[branch，flow]）
/BranchFlowsTable框架_表（典型->可追加，nrows->25，ncols->2，索引器->[index]，dc->[branch，flow]）
在[7]中：pd.read_hdf（'foo.h5'，'BranchFlowsTable'，其中='flow>0.5'）
出[7]：
分支流
14     14  1.503739
15     15  0.660297
17     17  0.685152
18     18  1.156073
20     20  0.994792
21     21  1.266463
23     23  0.927678

我不记得我为什么不创建一个pandas数据帧作为开始。现在，numpy抛出一个

ValueError:array太大。

当我尝试在中读取实际数据集时，这是另一个问题。您可以部分执行此操作（pass

append=True

），您可以传递部分帧并构建表，请参阅

In [6]: df.to_hdf('foo.h5','BranchFlowsTable',data_columns=True,table=True)

In [24]: with pd.get_store('foo.h5') as store:
    print(store)
   ....:     
<class 'pandas.io.pytables.HDFStore'>
File path: foo.h5
/BranchFlows                 frame_table [0.0.0] (typ->generic,nrows->25,ncols->2,indexers->[index],dc->[branch,flow])
/BranchFlowsTable            frame_table  (typ->appendable,nrows->25,ncols->2,indexers->[index],dc->[branch,flow])    

In [7]: pd.read_hdf('foo.h5','BranchFlowsTable',where='flow>0.5')
Out[7]: 

   branch      flow
14     14  1.503739
15     15  0.660297
17     17  0.685152
18     18  1.156073
20     20  0.994792
21     21  1.266463
23     23  0.927678