Python 熊猫：使用外部序列重新采样_Python_Pandas

Python 熊猫：使用外部序列重新采样

python pandas

Python 熊猫：使用外部序列重新采样,python,pandas,Python,Pandas,我想对DataFrame进行重采样，该数据框包含关于市场交易量和市场价格的日内数据，使用一个包含datetime的外部系列名为df的DataFrame示例如下所示：（编辑：更正示例数据集中的错误）系列称为beginpoints（它是每个间隔的起点），看起来像： 0 2013-04-15 21:45:00 1 2013-04-15 22:04:00 2 2013-04-15 22:13:00 考虑到我对交易量和区间开盘价的总和感兴趣，我最终希望得到以下解决方案

我想

对DataFrame
进行重采样，该数据框包含关于市场交易量和市场价格的日内数据，使用一个包含datetime
的外部系列
名为df
的DataFrame
示例如下所示：
（编辑：更正示例数据集中的错误）
系列
称为beginpoints
（它是每个间隔的起点），看起来像：
0      2013-04-15 21:45:00
1      2013-04-15 22:04:00
2      2013-04-15 22:13:00

考虑到我对交易量和区间开盘价的总和感兴趣，我最终希望得到以下解决方案：
Datetime                Volume     Price 
2013-04-15 21:45:00     144        50.00
2013-04-15 22:04:00     370        50.38
2013-04-15 22:13:00     64         50.02

我知道标准重采样类似于df.resample（'5min'，how={'Volume'：sum'，Price'：first}）
，例如每隔5分钟进行一次。但是，当我尝试将其修改为我的特定场景，并因此使用df.resample（beginpoints，how={'Volume'：sum'，Price'：first}）
时，我得到了一个ValueError
。这似乎很简单，但我似乎不知道我做错了什么。有人知道如何解决这个问题吗？谢谢
 我将提供一种方法。首先，我重置了beginpoints
系列的索引，并将beginpoints
作为索引
然后将索引
列提取为一个系列，以映射df
中的Datetime
列。有Datetime
值不是beginpoints
的一部分，使相应的点
N/a。但是由于Datetime
已排序，我们可以使用ffill
来填充这些N/a
s = pd.Series(["2013-04-15 21:45:00", "2013-04-15 22:04:00","2013-04-15 22:13:00"], name="beginpoints")

t = s.reset_index().set_index("beginpoints")
ts = t['index']

df['point'] = df['Datetime'].map(ts).fillna(method="ffill")

结果是：
              Datetime  Volume  Price  point
0  2013-04-15 21:45:00     100  50.00      0
1  2013-04-15 21:47:00      25  50.03      0
2  2013-04-15 21:52:00      15  50.05      0
3  2013-04-15 22:03:00       4  50.07      0
4  2013-04-15 22:04:00     145  50.38      1
5  2013-04-15 22:07:00      68  50.04      1
6  2013-04-15 22:12:00     157  49.93      1
7  2013-04-15 22:13:00      27  50.02      2
8  2013-04-15 22:19:00      37  49.91      2

              Datetime  Volume  Price
0  2013-04-15 21:45:00     144  50.00
1  2013-04-15 22:04:00     370  50.38
2  2013-04-15 22:13:00      64  50.02

最后，只需使用groupby
根据点计算汇总：
group= df.groupby(['point'])

df2 = pd.DataFrame()
df2['Datetime'] = group[['Datetime']].first()
df2['Volume'] = group[['Volume']].sum() 
df2['Price'] = group[['Price']].first()

结果是：
              Datetime  Volume  Price  point
0  2013-04-15 21:45:00     100  50.00      0
1  2013-04-15 21:47:00      25  50.03      0
2  2013-04-15 21:52:00      15  50.05      0
3  2013-04-15 22:03:00       4  50.07      0
4  2013-04-15 22:04:00     145  50.38      1
5  2013-04-15 22:07:00      68  50.04      1
6  2013-04-15 22:12:00     157  49.93      1
7  2013-04-15 22:13:00      27  50.02      2
8  2013-04-15 22:19:00      37  49.91      2

              Datetime  Volume  Price
0  2013-04-15 21:45:00     144  50.00
1  2013-04-15 22:04:00     370  50.38
2  2013-04-15 22:13:00      64  50.02

df
中有重复的日期时间，如22:08:00。它们有效吗？你完全正确，我在编日期时犯了一个愚蠢的错误，太快了。这一点现在得到纠正。感谢您指出这一点。谢谢，这是一个结合了reset\u index
、set\u index
和map
的聪明解决方案。很好用！