Python 2.7 如何在数据帧之间快速选择行
我想知道在索引的两个日期之间选择行的速度方面,什么是最有效的方法。比如说Python 2.7 如何在数据帧之间快速选择行,python-2.7,pandas,Python 2.7,Pandas,我想知道在索引的两个日期之间选择行的速度方面,什么是最有效的方法。比如说 >>> import pandas as pd >>> index = pd.date_range('2018-01-01', '2030-01-02', freq='BM') >>> df = pd.DataFrame(np.zeros((len(index), 1)), index=index) >>> df.head()
>>> import pandas as pd
>>> index = pd.date_range('2018-01-01', '2030-01-02', freq='BM')
>>> df = pd.DataFrame(np.zeros((len(index), 1)), index=index)
>>> df.head()
0
2018-01-31 0.0
2018-02-28 0.0
2018-03-30 0.0
2018-04-30 0.0
2018-05-31 0.0
然后,选择例如2018-05-30
2027-07-03
之间所有行的一种方法是
>>> df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')]
>>df.loc[(df.index>=“2018-05-30”)和(df.index您的原始方法看起来是两个选项中速度更快的:
使用“&”查找:
In[]: %timeit -r 5 -n 10 df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')]
Out[]: 10 loops, best of 5: 501 µs per loop
In[]: %timeit -r 5 -n 10 df.loc['2018-05-30':'2027-07-03']
Out[]: 10 loops, best of 5: 724 µs per loop
因此,您已经在使用优化的操作
编辑:添加了另一个较慢的操作,以证明此操作已经很快:
In[]: %timeit -r 5 -n 10 df[df.index.isin(pd.date_range("2018-05-30", "2027-07-03").values)]
Out[]: 10 loops, best of 5: 1.13 ms per loop
您可以使用:
打印(df.截断(2018-05-30之前,='2027-07-03'之后)
打印(df.loc['2018-05-30':'2027-07-03')
打印(df.loc[(df.index>='2018-05-30')和(df.index='2018-05-30')和(df.index您认为<代码>打印(df.loc['2018-05-30':'2027-07-03'))
?如果之前不知道选择的日期,您希望如何选择?@jezrael不知道这些值是不明确的。我的意思是它们不是恒定的,即动态地变化,就像每次随机生成的一样。:
是最快的方法吗?谢谢您的回答。您知道是否有宴会方法吗?可能根本不使用索引而是将其转换为另一种格式?很好的建议是截断
,我以前从未见过这种方法。非常方便。
print (df.truncate(before='2018-05-30', after='2027-07-03'))
print (df.loc['2018-05-30':'2027-07-03'])
print (df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')])
In [366]: %timeit (df.loc['2018-05-30':'2027-07-03'])
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.43 ms per loop
In [367]: %timeit (df.loc[(df.index >= '2018-05-30') & (df.index <= '2027-07-03')])
The slowest run took 4.97 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 502 µs per loop
In [368]: %timeit (df.truncate(before='2018-05-30', after='2027-07-03'))
The slowest run took 4.98 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 450 µs per loop
In [372]: %timeit (df.loc[(df.index >= '2018-05-31') & (df.index < '2027-05-31')])
The slowest run took 4.81 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 520 µs per loop
In [373]: %timeit (df.iloc[df.index.searchsorted('2018-05-31'): df.index.searchsorted('2027-05-31')])
10000 loops, best of 3: 136 µs per loop