Python 多索引HDFStore的磁盘索引_Python_Pandas_Hdf

Python 多索引HDFStore的磁盘索引

python pandas

Python 多索引HDFStore的磁盘索引,python,pandas,hdf,Python,Pandas,Hdf,为了提高性能和减少内存占用，我尝试读取Pandas中创建的多索引HDFStore。原来的商店相当大，但是这个问题可以用一个类似但较小的例子重现 df = pd.DataFrame([0.25, 0.5, 0.75, 1.0], index=['Item0', 'Item1', 'Item2', 'Item3'], columns=['Values']) df = pd.concat((df.iloc[:],df.iloc[:]), axis=0,na

为了提高性能和减少内存占用，我尝试读取Pandas中创建的多索引HDFStore。原来的商店相当大，但是这个问题可以用一个类似但较小的例子重现

df = pd.DataFrame([0.25, 0.5, 0.75, 1.0],
                      index=['Item0', 'Item1', 'Item2', 'Item3'], columns=['Values'])

df = pd.concat((df.iloc[:],df.iloc[:]), axis=0,names=['Item','N'], 
               keys = ['Items0','Items1'])

df.to_hdf('hdfs.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc',data_columns=True) 

store = pd.HDFStore('hdfs.h5', mode= 'r')

store.select('df',where='Item="Items0"')

这将返回子索引的值，但返回错误

> ValueError: The passed where expression: Item="Items0"
>             contains an invalid variable reference
>             all of the variable refrences must be a reference to
>             an axis (e.g. 'index' or 'columns'), or a data_column
>             The currently defined references are: index,iron,columns

指数如下：

store['df'].index

> MultiIndex(levels=[['Items0', 'Items1'], ['Item0', 'Item1', 'Item2',
> 'Item3']],
>            labels=[[0, 0, 0, 0, 1, 1, 1, 1], [0, 1, 2, 3, 0, 1, 2, 3]],
>            names=['Item', 'N'])

有人能解释一下原因吗？或者如何正确执行…

如果删除

数据\u columns=True

：

df.to_hdf('hdfs3.h5', 'df', format='table',mode='w',complevel= 9,complib='blosc') 
store = pd.HDFStore('hdfs3.h5', mode= 'r')
print (store.select('df','Item="Items0"'))
              Values
Item   N            
Items0 Item0    0.25
       Item1    0.50
       Item2    0.75
       Item3    1.00

尝试将

data\u columns=True

替换为

data\u columns=df.columns.tolist（）

演示：

原始多索引DF：

In [2]: df
Out[2]:
              Values
Item   N
Items0 Item0    0.25
       Item1    0.50
       Item2    0.75
       Item3    1.00
Items1 Item0    0.25
       Item1    0.50
       Item2    0.75
       Item3    1.00

使用

data\u columns=df.columns.tolist（）

将其保存到HDF5：

从HDF存储中选择：

In [5]: store = pd.HDFStore('c:/temp/hdfs.h5')

索引级别和
值列现在都已编制索引，可用于where= 参数： In [6]: store.select('df',where='Item="Items0" and Values in [0.5, 1]') Out[6]: Values Item N Items0 Item1 0.5 Item3 1.0 In [7]: store.select('df',where='N="Item3" and Values in [0.5, 1]') Out[7]: Values Item N Items0 Item3 1.0 Items1 Item3 1.0 存储者信息： In [8]: store.get_storer('df').table Out[8]: /df/table (Table(8,), shuffle, blosc(9)) '' description := { "index": Int64Col(shape=(), dflt=0, pos=0), "N": StringCol(itemsize=5, shape=(), dflt=b'', pos=1), "Item": StringCol(itemsize=6, shape=(), dflt=b'', pos=2), "Values": Float64Col(shape=(), dflt=0.0, pos=3)} byteorder := 'little' chunkshape := (2427,) autoindex := True colindexes := { "Values": Index(6, medium, shuffle, zlib(1)).is_csi=False, "index": Index(6, medium, shuffle, zlib(1)).is_csi=False, "Item": Index(6, medium, shuffle, zlib(1)).is_csi=False, "N": Index(6, medium, shuffle, zlib(1)).is_csi=False} 存储者索引级别： In [9]: store.get_storer('df').levels Out[9]: ['Item', 'N'] 注意：如果您只需省略数据列参数，则HDF存储中将仅索引，其他所有列将不可搜索：演示：更新：推杆的真正区别是什么 data_columns=df.columns.tolist（） [18]中的：fn=r'd:/temp/a.h5' 在[19]中：df.to_hdf（fn，'dc_true'，data_columns=true，format='t'，mode='w'，complevel=9，complib='blosc'）在[20]中：df.to_hdf（fn，'dc_cols'，data_columns=df.columns.tolist（），format='t'，complevel=9，complib='blosc'） In[21]：store=pd.HDFStore（fn）在[22]：商店出[22]：文件路径：d:/temp/a.h5 /dc列框架表格（类型->可追加的多个，nrows->8，ncols->3，索引器->索引，dc->[N，项，值]） /dc\u真帧\u表（类型->可追加\u多，nrows->8，ncols->3，索引器->索引，dc->值）在[23]中：store.get\u storer（'dc\u true'）.table.colindexes 出[23]： { “值”：索引（6，中等，随机，zlib（1））。为_csi=False， “索引”：索引（6，中等，随机，zlib（1））.is_csi=False} 在[24]中：store.get\u storer（'dc\u cols'）.table.colindexes 出[24]： { “项目”：索引（6，中等，随机，zlib（1））.is_csi=False，谢谢你的回复..不过很奇怪，pandas文档和很多参考资料都说你必须使用'data_columns=True'是的，我也尝试过找到它，但没有成功。In从不与多索引一起使用，所以可能不受支持。但我不知道。pandas官方文档和很多参考资料都说你有e使用'data_columns=True'，例如，可能有什么不同？无论如何，谢谢..请注意，索引名（项，N ）旁边的所有列都不会以这种方式索引，并且不能在where 子句中使用。例如，此查询将不起作用：存储。选择（'df'，where='N=“Item3”和中的值[0.5,1]”） A follow-任何关注此问题的人都请看一看，特别是，希望它有助于这方面的尝试感谢您提供详细的答案。放置data\u columns=df.columns.tolist（）的真正区别是什么？。我用pandas series表代替了pandas dataframe表，该表工作正常，无需修改我发布的任何参数。无论如何，谢谢，在官方文档中很难找到此类信息。@Suraj，请参阅我的回答中的更新。很好，现在事情更清楚了。感谢您的努力。到我认为使用多索引HDFStore进行磁盘索引非常困难 In [9]: store.get_storer('df').levels Out[9]: ['Item', 'N'] In [19]: df.to_hdf('c:/temp/NO_data_columns.h5', 'df', format='t',mode='w',complevel=9,complib='blosc') In [20]: store = pd.HDFStore('c:/temp/NO_data_columns.h5') In [21]: store.select('df',where='N == "Item3"') Out[21]: Values Item N Items0 Item3 1.0 Items1 Item3 1.0 In [22]: store.select('df',where='N == "Item3" and Values == 1') --------------------------------------------------------------------------- ... skipped ... ValueError: The passed where expression: N == "Item3" and Values == 1 contains an invalid variable reference all of the variable refrences must be a reference to an axis (e.g. 'index' or 'columns'), or a data_column The currently defined references are: N,index,Item,columns In [18]: fn = r'd:/temp/a.h5' In [19]: df.to_hdf(fn,'dc_true',data_columns=True,format='t',mode='w',complevel=9,complib='blosc') In [20]: df.to_hdf(fn,'dc_cols',data_columns=df.columns.tolist(),format='t',complevel=9,complib='blosc') In [21]: store = pd.HDFStore(fn) In [22]: store Out[22]: <class 'pandas.io.pytables.HDFStore'> File path: d:/temp/a.h5 /dc_cols frame_table (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[N,Item,Values]) /dc_true frame_table (typ->appendable_multi,nrows->8,ncols->3,indexers->[index],dc->[Values]) In [23]: store.get_storer('dc_true').table.colindexes Out[23]: { "Values": Index(6, medium, shuffle, zlib(1)).is_csi=False, "index": Index(6, medium, shuffle, zlib(1)).is_csi=False} In [24]: store.get_storer('dc_cols').table.colindexes Out[24]: { "Item": Index(6, medium, shuffle, zlib(1)).is_csi=False, # <- missing when `data_columns=True` "N": Index(6, medium, shuffle, zlib(1)).is_csi=False, # <- missing when `data_columns=True` "Values": Index(6, medium, shuffle, zlib(1)).is_csi=False, "index": Index(6, medium, shuffle, zlib(1)).is_csi=False}