Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/356.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python中分层索引数据中时间戳范围的快速选择_Python_Pandas_Timestamp - Fatal编程技术网

Python中分层索引数据中时间戳范围的快速选择

Python中分层索引数据中时间戳范围的快速选择,python,pandas,timestamp,Python,Pandas,Timestamp,使用具有tz-aware DatetimeIndex的数据帧(如下所示)是在两个日期之间选择多行的一种快速方法,用于左包含、右独占间隔: import pandas as pd start_ts = pd.Timestamp('20000101 12:00 UTC') end_ts = pd.Timestamp('20000102 12:00 UTC') ix_df = pd.DataFrame(0, index=[pd.Timestamp('20000101 00:00 UTC'), pd.T

使用具有tz-aware DatetimeIndex的数据帧(如下所示)是在两个日期之间选择多行的一种快速方法,用于左包含、右独占间隔:

import pandas as pd
start_ts = pd.Timestamp('20000101 12:00 UTC')
end_ts = pd.Timestamp('20000102 12:00 UTC')
ix_df = pd.DataFrame(0, index=[pd.Timestamp('20000101 00:00 UTC'), pd.Timestamp('20000102 00:00 UTC')], columns=['a'])
EPSILON_TIME = pd.tseries.offsets.Nano()
ix_df[start_ts:end_ts-EPSILON_TIME]
上面的解决方案相当有效,因为我们不会像我稍后将要做的那样创建临时索引iterable,也不会在Python中运行lambda表达式来创建新的数据帧。事实上,我相信选择最多在O(log(N))左右。我想知道在多索引的特定轴上是否也可以这样做,或者我必须创建一个临时iterable或运行lambda表达式。例如:

mux = pd.MultiIndex.from_arrays([[pd.Timestamp('20000102 00:00 UTC'), pd.Timestamp('20000103 00:00 UTC')], [pd.Timestamp('20000101 00:00 UTC'), pd.Timestamp('20000102 00:00 UTC')]])
mux_df = pd.DataFrame(0, index=mux, columns=['a'])
然后,我可以用同样的方法选择索引的第一(第零)级:

mux_df[start_ts:end_ts-EPSILON_TIME]
这将产生:

                                                     a
2000-01-02 00:00:00+00:00 2000-01-01 00:00:00+00:00  0
但对于第二个层次,我必须选择一个缓慢的解决方案:

values_itr = mux_df.index.get_level_values(1)
mask_ser = (values_itr >= start_ts) & (values_itr < end_ts)
mux_df[mask_ser]
有什么快速的解决办法吗?谢谢

编辑:选择的方法

毕竟,当我意识到我也需要切片时,最终得到了这个解决方案:

def view(data_df):
    if len(data_df.index) == 0:
        return data_df
    values_itr = data_df.index.get_level_values(0)
    values_itr = values_itr.values
    from_i = np.searchsorted(values_itr, np.datetime64(start_ts), side='left')
    to_i = np.searchsorted(values_itr, np.datetime64(end_ts), side='left')
    return data_df.ix[from_i:to_i]

然后执行查看(数据)。复制()。注意:我在索引的第一级中的值实际上是经过排序的。

你实际上是在比较苹果和桔子

In [59]: N = 1000000

In [60]: pd.set_option('max_rows',10)

In [61]: idx = pd.IndexSlice

In [62]: df = DataFrame(np.arange(N).reshape(-1,1),columns=['value'],index=pd.MultiIndex.from_product([list('abcdefghij'),date_range('20010101',periods=N/10,freq='T',tz='US/Eastern')],names=['one','two']))

In [63]: df
Out[63]: 
                                value
one two                              
a   2001-01-01 00:00:00-05:00       0
    2001-01-01 00:01:00-05:00       1
    2001-01-01 00:02:00-05:00       2
    2001-01-01 00:03:00-05:00       3
    2001-01-01 00:04:00-05:00       4
...                               ...
j   2001-03-11 10:35:00-05:00  999995
    2001-03-11 10:36:00-05:00  999996
    2001-03-11 10:37:00-05:00  999997
    2001-03-11 10:38:00-05:00  999998
    2001-03-11 10:39:00-05:00  999999

[1000000 rows x 1 columns]

In [64]: df2 = df.reset_index(level='one').sort_index()
df
In [65]: df2
Out[65]: 
                          one   value
two                                  
2001-01-01 00:00:00-05:00   a       0
2001-01-01 00:00:00-05:00   i  800000
2001-01-01 00:00:00-05:00   h  700000
2001-01-01 00:00:00-05:00   g  600000
2001-01-01 00:00:00-05:00   f  500000
...                        ..     ...
2001-03-11 10:39:00-05:00   c  299999
2001-03-11 10:39:00-05:00   b  199999
2001-03-11 10:39:00-05:00   a   99999
2001-03-11 10:39:00-05:00   i  899999
2001-03-11 10:39:00-05:00   j  999999

[1000000 rows x 2 columns]
当我重置索引时(我创建一个单级索引),它不再是唯一的。这有很大的不同,因为它的搜索方式不同。因此,您无法真正比较单级唯一索引和多级索引上的索引

结果是使用多索引切片器(在0.14.0中引入)。使索引在任何级别上都非常快速

In [66]: %timeit df.loc[idx[:,'20010201':'20010301'],:]
1 loops, best of 3: 188 ms per loop

In [67]: df.loc[idx[:,'20010201':'20010301'],:]
Out[67]: 
                                value
one two                              
a   2001-02-01 00:00:00-05:00   44640
    2001-02-01 00:01:00-05:00   44641
    2001-02-01 00:02:00-05:00   44642
    2001-02-01 00:03:00-05:00   44643
    2001-02-01 00:04:00-05:00   44644
...                               ...
j   2001-03-01 23:55:00-05:00  986395
    2001-03-01 23:56:00-05:00  986396
    2001-03-01 23:57:00-05:00  986397
    2001-03-01 23:58:00-05:00  986398
    2001-03-01 23:59:00-05:00  986399

[417600 rows x 1 columns]
将其与非唯一的单个级别进行比较

In [68]: %timeit df2.loc['20010201':'20010301']
1 loops, best of 3: 470 ms per loop
In [73]: df3 = DataFrame(np.arange(N).reshape(-1,1),columns=['value'],index=date_range('20010101',periods=N,freq='T',tz='US/Eastern'))

In [74]: df3
Out[74]: 
                            value
2001-01-01 00:00:00-05:00       0
2001-01-01 00:01:00-05:00       1
2001-01-01 00:02:00-05:00       2
2001-01-01 00:03:00-05:00       3
2001-01-01 00:04:00-05:00       4
...                           ...
2002-11-26 10:35:00-05:00  999995
2002-11-26 10:36:00-05:00  999996
2002-11-26 10:37:00-05:00  999997
2002-11-26 10:38:00-05:00  999998
2002-11-26 10:39:00-05:00  999999

[1000000 rows x 1 columns]

In [75]: df3.loc['20010201':'20010301']
Out[75]: 
                           value
2001-02-01 00:00:00-05:00  44640
2001-02-01 00:01:00-05:00  44641
2001-02-01 00:02:00-05:00  44642
2001-02-01 00:03:00-05:00  44643
2001-02-01 00:04:00-05:00  44644
...                          ...
2001-03-01 23:55:00-05:00  86395
2001-03-01 23:56:00-05:00  86396
2001-03-01 23:57:00-05:00  86397
2001-03-01 23:58:00-05:00  86398
2001-03-01 23:59:00-05:00  86399

[41760 rows x 1 columns]
这里是一个独特的单一级别

In [68]: %timeit df2.loc['20010201':'20010301']
1 loops, best of 3: 470 ms per loop
In [73]: df3 = DataFrame(np.arange(N).reshape(-1,1),columns=['value'],index=date_range('20010101',periods=N,freq='T',tz='US/Eastern'))

In [74]: df3
Out[74]: 
                            value
2001-01-01 00:00:00-05:00       0
2001-01-01 00:01:00-05:00       1
2001-01-01 00:02:00-05:00       2
2001-01-01 00:03:00-05:00       3
2001-01-01 00:04:00-05:00       4
...                           ...
2002-11-26 10:35:00-05:00  999995
2002-11-26 10:36:00-05:00  999996
2002-11-26 10:37:00-05:00  999997
2002-11-26 10:38:00-05:00  999998
2002-11-26 10:39:00-05:00  999999

[1000000 rows x 1 columns]

In [75]: df3.loc['20010201':'20010301']
Out[75]: 
                           value
2001-02-01 00:00:00-05:00  44640
2001-02-01 00:01:00-05:00  44641
2001-02-01 00:02:00-05:00  44642
2001-02-01 00:03:00-05:00  44643
2001-02-01 00:04:00-05:00  44644
...                          ...
2001-03-01 23:55:00-05:00  86395
2001-03-01 23:56:00-05:00  86396
2001-03-01 23:57:00-05:00  86397
2001-03-01 23:58:00-05:00  86398
2001-03-01 23:59:00-05:00  86399

[41760 rows x 1 columns]
迄今为止最快的

In [76]: %timeit df3.loc['20010201':'20010301']
1 loops, best of 3: 294 ms per loop
Best是没有时区的单级唯一

In [77]: df3 = DataFrame(np.arange(N).reshape(-1,1),columns=['value'],index=date_range('20010101',periods=N,freq='T'))

In [78]: %timeit df3.loc['20010201':'20010301']
1 loops, best of 3: 240 ms per loop
到目前为止,这是最快的方法(我在这里做了一个稍微不同的搜索,以获得相同的结果,因为上述搜索的语义包括指定日期的所有日期)

[101]中的
:df4=df3.reset_index()

在[103]:%timeit df4.loc[(df4['index']>='20010201')&(df4['index']='20010201')&(df4['index']='20010201')&(df4['index'])谢谢Jeff!尝试了第4个,然后意识到我也需要切片视图,所以最终得到了上面的代码。