Python 将cumxxx（总和，最小值…）应用于数据帧中大小不同的窗口_Python_Pandas_Cumsum_Rolling Computation

Python 将cumxxx（总和，最小值…）应用于数据帧中大小不同的窗口

python pandas

Python 将cumxxx（总和，最小值…）应用于数据帧中大小不同的窗口,python,pandas,cumsum,rolling-computation,Python,Pandas,Cumsum,Rolling Computation,我想对数据帧中大小不同的窗口应用cumxxx操作。考虑到下列投入： import pandas as pd from random import seed, randint from collections import OrderedDict p5h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='5h', name='p5h') p1h = pd.period_range(start='2

我想对数据帧中大小不同的窗口应用

cumxxx

操作。考虑到下列投入：

import pandas as pd
from random import seed, randint
from collections import OrderedDict

p5h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='5h', name='p5h')
p1h = pd.period_range(start='2020-02-01 00:00', end='2020-02-04 00:00', freq='1h', name='p1h')

seed(1)
values = [randint(0,10) for p in p1h]
df = pd.DataFrame({'Values' : values}, index=p1h)

p5h_st_as_series = p5h.start_time.to_series()

df['OpeneningPeriod'] = df.apply(
              lambda x: p5h.to_series().loc[p5h_st_as_series.index <=
                                            x.name.start_time].index[-1],
                                 axis=1)

此处，将在定义的5小时期间应用

cumxxx

。它可以是不同长度的，因为窗口可以是一天周期（有些带有DST），也可以是一个月周期（不是一个月内的固定小时数）

我想要的结果是：

df_result.head(11)
                  Values   OpeneningPeriod   Cumsum
p1h                                       
2020-02-01 00:00       2  2020-02-01 00:00        2  <- cumsum starts with a new period
2020-02-01 01:00       9  2020-02-01 00:00       11
2020-02-01 02:00       1  2020-02-01 00:00       12
2020-02-01 03:00       4  2020-02-01 00:00       16
2020-02-01 04:00       1  2020-02-01 00:00       17
2020-02-01 05:00       7  2020-02-01 05:00        7  <- cumsum starts with a new period
2020-02-01 06:00       7  2020-02-01 05:00       14
2020-02-01 07:00       7  2020-02-01 05:00       21
2020-02-01 08:00      10  2020-02-01 05:00       31
2020-02-01 09:00       6  2020-02-01 05:00       37
2020-02-01 10:00       3  2020-02-01 10:00        3  <- cumsum starts with a new period

df_结果头（11）
值开启周期累计数
p1h
2020-02-01 00:00 2 2020-02-01 00:00 2如果需要分组，按5H
窗口按DatetimeIndex
使用cumsum
：
df['Cumsum'] = df.resample('5H')['Values'].cumsum()

或：

groupby
应该是一个很好的起点：
df['Cumsum'] = df.groupby('OpeneningPeriod')['Values'].cumsum()

它给出：
                  Values  OpeneningPeriod  Cumsum
p1h                                              
2020-02-01 00:00       2 2020-02-01 00:00       2
2020-02-01 01:00       9 2020-02-01 00:00      11
2020-02-01 02:00       1 2020-02-01 00:00      12
2020-02-01 03:00       4 2020-02-01 00:00      16
2020-02-01 04:00       1 2020-02-01 00:00      17
2020-02-01 05:00       7 2020-02-01 05:00       7
2020-02-01 06:00       7 2020-02-01 05:00      14
2020-02-01 07:00       7 2020-02-01 05:00      21
2020-02-01 08:00      10 2020-02-01 05:00      31
2020-02-01 09:00       6 2020-02-01 05:00      37
2020-02-01 10:00       3 2020-02-01 10:00       3
2020-02-01 11:00       1 2020-02-01 10:00       4
2020-02-01 12:00       7 2020-02-01 10:00      11
2020-02-01 13:00       0 2020-02-01 10:00      11
2020-02-01 14:00       6 2020-02-01 10:00      17
2020-02-01 15:00       6 2020-02-01 15:00       6
...

谢谢@jezrael，我保留了您的第一个解决方案，并重新采样。您知道重采样是否也可以保留第一个值，但保持索引不变？（即将第一个值复制到以下4行）如果我应用first（），索引将以“5H”频率重新采样。所以中间行不会被保留。@pierre_j-我想你需要df['first']=df.resample（'5H'）['Values'].transform（'first'）
print (df.head(11))
                  Values   OpeneningPeriod  Cumsum
p1h                                               
2020-02-01 00:00       2  2020-02-01 00:00       2
2020-02-01 01:00       9  2020-02-01 00:00      11
2020-02-01 02:00       1  2020-02-01 00:00      12
2020-02-01 03:00       4  2020-02-01 00:00      16
2020-02-01 04:00       1  2020-02-01 00:00      17
2020-02-01 05:00       7  2020-02-01 05:00       7
2020-02-01 06:00       7  2020-02-01 05:00      14
2020-02-01 07:00       7  2020-02-01 05:00      21
2020-02-01 08:00      10  2020-02-01 05:00      31
2020-02-01 09:00       6  2020-02-01 05:00      37
2020-02-01 10:00       3  2020-02-01 10:00       3

df['Cumsum'] = df.groupby('OpeneningPeriod')['Values'].cumsum()

                  Values  OpeneningPeriod  Cumsum
p1h                                              
2020-02-01 00:00       2 2020-02-01 00:00       2
2020-02-01 01:00       9 2020-02-01 00:00      11
2020-02-01 02:00       1 2020-02-01 00:00      12
2020-02-01 03:00       4 2020-02-01 00:00      16
2020-02-01 04:00       1 2020-02-01 00:00      17
2020-02-01 05:00       7 2020-02-01 05:00       7
2020-02-01 06:00       7 2020-02-01 05:00      14
2020-02-01 07:00       7 2020-02-01 05:00      21
2020-02-01 08:00      10 2020-02-01 05:00      31
2020-02-01 09:00       6 2020-02-01 05:00      37
2020-02-01 10:00       3 2020-02-01 10:00       3
2020-02-01 11:00       1 2020-02-01 10:00       4
2020-02-01 12:00       7 2020-02-01 10:00      11
2020-02-01 13:00       0 2020-02-01 10:00      11
2020-02-01 14:00       6 2020-02-01 10:00      17
2020-02-01 15:00       6 2020-02-01 15:00       6
...