Python 使用多个DateTimeIndex分解数据帧的最有效方法_Python_Pandas

Python 使用多个DateTimeIndex分解数据帧的最有效方法

python pandas

Python 使用多个DateTimeIndex分解数据帧的最有效方法,python,pandas,Python,Pandas,我有一个数据框，其中包含了一种证券在很长一段时间内每分钟的价格我想提取一个子集的价格，每天1之间的特定时间下面是一个暴力强制的例子（为了简洁起见，使用每小时一次）：我现在有一个DateTimeIndex，用于我每天要提取的小时数： start = datetime.datetime(2018,1,1,8) end = datetime.datetime(2018,1,1,17) day1 = pandas.date_range(start, end, freq='H') start

我有一个数据框，其中包含了一种证券在很长一段时间内每分钟的价格

我想提取一个子集的价格，每天1之间的特定时间

下面是一个暴力强制的例子（为了简洁起见，使用每小时一次）：

我现在有一个

DateTimeIndex

，用于我每天要提取的小时数：

start = datetime.datetime(2018,1,1,8)
end   = datetime.datetime(2018,1,1,17)
day1  = pandas.date_range(start, end, freq='H')

start = datetime.datetime(2018,1,2,9)
end   = datetime.datetime(2018,1,2,13)
day2  = pandas.date_range(start, end, freq='H')

days = [ day1, day2 ]

然后，我可以使用

prices.index.isin

和我的每个

datetimeindex

来提取相关的当天价格：

daily_prices = [ prices[prices.index.isin(d)] for d in days]

这与预期的效果一样：

daily_prices[0]

问题是，随着每次选择

DateTimeIndex

的长度增加，以及我想要提取的天数增加，我对列表的理解会变得缓慢

由于我知道每个选择的

DateTimeIndex

都完全包含它所包含的小时数，所以我尝试使用

loc

和列表中每个索引的第一个和最后一个元素：

daily_prices = [ prices.loc[d[0]:d[-1]] for d in days]

虽然速度稍快，但当天数非常大时，速度仍然异常缓慢

是否有一种更有效的方法将数据帧划分为上述开始和结束时间范围？

如果每天的时间看起来是一致的，您可以只过滤索引，这应该非常快：

In [5]: prices.loc[prices.index.hour.isin(range(8,18))]
Out[5]:
                        price
2018-01-01 08:00:00  0.638051
2018-01-01 09:00:00  0.059258
2018-01-01 10:00:00  0.869144
2018-01-01 11:00:00  0.443970
2018-01-01 12:00:00  0.725146
2018-01-01 13:00:00  0.309600
2018-01-01 14:00:00  0.520718
2018-01-01 15:00:00  0.976284
2018-01-01 16:00:00  0.973313
2018-01-01 17:00:00  0.158488
2018-01-02 08:00:00  0.053680
2018-01-02 09:00:00  0.280477
2018-01-02 10:00:00  0.802826
2018-01-02 11:00:00  0.379837
2018-01-02 12:00:00  0.247583
....

编辑：对于你的评论，直接在索引上工作，然后在末尾进行单个查找可能仍然是最快的，即使它不是每天都一致。使用groupby，最后的单日帧将很容易

例如：

df = prices.loc[[i for i in prices.index if (i.hour in range(8, 18) and i.day in range(1,10)) or (i.hour in range(2,4) and i.day in range(11,32))]] 
framelist = [frame for _, frame in df.groupby(df.index.date)]

将为您提供每个列表元素1天的数据帧列表，并将包括每月前10天的8:00-17:00和11-31天的2:00-3:00。

谢谢您的帮助，但不幸的是，它们不是。我将更新问题以反映这一事实。此外，我希望返回多个数据帧，每天1个。我认为您在子集设置

价格

时浪费了大量时间，这可能是非常大、非常大的次数。由于您需要为每天指定单独的时间，可能首先

groupby

获取每天的数据帧，然后相应地屏蔽更小的

DataFrames

。或者，如果你知道你只有10-15个独特的范围（比如上午8点到晚上10点，上午7点到下午1点，上午7点到下午5点）。。。您只需将这一小部分时间子集，然后选择正确的dates@Alollz也许我应该在这个问题中提到这个，但是我的目标是通过PybDun11将它暴露给C++库。因此，我想使用熊猫来完成所有的数据切片，然后将底层的NUMPY数据数组暴露为C++。

In [5]: prices.loc[prices.index.hour.isin(range(8,18))]
Out[5]:
                        price
2018-01-01 08:00:00  0.638051
2018-01-01 09:00:00  0.059258
2018-01-01 10:00:00  0.869144
2018-01-01 11:00:00  0.443970
2018-01-01 12:00:00  0.725146
2018-01-01 13:00:00  0.309600
2018-01-01 14:00:00  0.520718
2018-01-01 15:00:00  0.976284
2018-01-01 16:00:00  0.973313
2018-01-01 17:00:00  0.158488
2018-01-02 08:00:00  0.053680
2018-01-02 09:00:00  0.280477
2018-01-02 10:00:00  0.802826
2018-01-02 11:00:00  0.379837
2018-01-02 12:00:00  0.247583
....

df = prices.loc[[i for i in prices.index if (i.hour in range(8, 18) and i.day in range(1,10)) or (i.hour in range(2,4) and i.day in range(11,32))]] 
framelist = [frame for _, frame in df.groupby(df.index.date)]