Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/314.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/flutter/9.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python HDFStore.选择比数据帧切片慢一个数量级?_Python_Pandas_Hdfstore_Hdf - Fatal编程技术网

Python HDFStore.选择比数据帧切片慢一个数量级?

Python HDFStore.选择比数据帧切片慢一个数量级?,python,pandas,hdfstore,hdf,Python,Pandas,Hdfstore,Hdf,给定一个带有整数索引和浮点列的简单数据帧,以下代码: store = pd.HDFStore('test.hdf5') print store.select('df', where='index >= 50000')['A'].mean() store = pd.HDFStore('test.hdf5') print store.get('df')['A'][50000:].mean() 至少比此代码慢10倍: store = pd.HDFStore('test.hdf5') prin

给定一个带有整数索引和浮点列的简单数据帧,以下代码:

store = pd.HDFStore('test.hdf5')
print store.select('df', where='index >= 50000')['A'].mean()
store = pd.HDFStore('test.hdf5')
print store.get('df')['A'][50000:].mean()
至少比此代码慢10倍:

store = pd.HDFStore('test.hdf5')
print store.select('df', where='index >= 50000')['A'].mean()
store = pd.HDFStore('test.hdf5')
print store.get('df')['A'][50000:].mean()
表或固定格式不会产生很大的差异,select()调用虽然相当于切片,但速度要慢得多

谢谢你的见解

如果格式为“固定”,则无法进行选择。这将引发一个异常(访问时间实际上要快得多)。也就是说,您可以直接为固定格式编制索引

In [39]: df = DataFrame(np.random.randn(1000000,10))

In [40]: df.to_hdf('test.h5','df',mode='w',format='table')

In [41]: def f():
    df = pd.read_hdf('test.h5','df')
    return df.loc[50001:,0]
   ....: 

In [42]: def g():
    df = pd.read_hdf('test.h5','df')
    return df.loc[df.index>50000,0]
   ....: 

In [43]: def h():
    return pd.read_hdf('test.h5','df',where='index>50000')[0]
   ....: 

In [44]: f().equals(g())
Out[44]: True

In [46]: f().equals(h())
Out[46]: True

In [47]: %timeit f()
10 loops, best of 3: 159 ms per loop

In [48]: %timeit g()
10 loops, best of 3: 127 ms per loop

In [49]: %timeit h()
1 loops, best of 3: 499 ms per loop
当然要慢一点。但是你要做更多的工作。这是将布尔索引器与整个数组进行比较。如果读取整个帧,那么它有很多优点(例如缓存、局部性)

当然,如果您只是选择一个连续的切片,那么只需这样做

In [59]: def i():
    return pd.read_hdf('test.h5','df',start=50001)[0]
   ....: 

In [60]: i().equals(h())
Out[60]: True

In [61]: %timeit i()
10 loops, best of 3: 86.6 ms per loop

有道理,谢谢。但是,如果我的数据帧足够大,无法放入内存,我就别无选择,只能在表上还原为select,对吗?或者在固定格式上建立索引仍然是一种选择?您可以在固定格式上建立索引,但只能按位置建立索引;为了进行实际选择,您需要使用表格格式