Python 按两列分组，并在日期上以6个月为回望窗口进行累计_Python_Python 3.x_Pandas_Pandas Groupby

Python 按两列分组，并在日期上以6个月为回望窗口进行累计

python python-3.x pandas

Python 按两列分组，并在日期上以6个月为回望窗口进行累计,python,python-3.x,pandas,pandas-groupby,Python,Python 3.x,Pandas,Pandas Groupby,原始数据集 userId createDate grade 0 2016-05-08 22:00:49.673 2 0 2016-07-23 12:37:11.570 7 0 2017-01-03 12:05:33.060 7 1009 2016-06-27 09:28:19.677 5 1009 2016-07-23 12:37:11.570

原始数据集

userId     createDate                  grade
0          2016-05-08 22:00:49.673     2
0          2016-07-23 12:37:11.570     7
0          2017-01-03 12:05:33.060     7
1009       2016-06-27 09:28:19.677     5
1009       2016-07-23 12:37:11.570     8
1009       2017-01-03 12:05:33.060     9
1009       2017-02-08 16:17:17.547     4
2011       2016-11-03 14:30:25.390     6
2011       2016-12-15 21:06:14.730     11
2011       2017-01-04 20:22:31.423     2
2011       2017-08-08 16:17:17.547     7

我希望从createDate开始，每个用户的回望窗口为6个月，即（从createDate开始，6个月以下的用户的所有评分总和）预期：

userId     createDate                 
    0          2016-05-08 22:00:49.673     2
               2016-07-23 12:37:11.570     9
               2017-01-03 12:05:33.060     14
    1009       2016-06-27 09:28:19.677     5
               2016-07-23 12:37:11.570     13
               2017-01-03 12:05:33.060     17
               2017-02-08 16:17:17.547     13
    2011       2016-11-03 14:30:25.390     6
               2016-12-15 21:06:14.730     17
               2017-01-04 20:22:31.423     19
               2017-08-08 16:17:17.547     7

我当前的尝试不正确：

df.groupby(['userId','createDate'])['grade'].mean().groupby([pd.Grouper(level='userId'),pd.TimeGrouper('6M', level='createDate', closed = 'left')]).cumsum()

它给了我以下结果：

userId  createDate             
0       2016-05-08 22:00:49.673     2
        2016-07-23 12:37:11.570     9
        2017-01-03 12:05:33.060     7
1009    2016-06-27 09:28:19.677     5
        2016-07-23 12:37:11.570    13
        2017-01-03 12:05:33.060     9
        2017-02-08 16:17:17.547    13
2011    2016-11-03 14:30:25.390     6
        2016-12-15 21:06:14.730    17
        2017-01-04 20:22:31.423    19
        2017-08-08 16:17:17.547     7

使用

groupby

和

rolling sum

内部

apply

，偏移量为

180D

而不是6个月，因为月中的天数往往会每连续几个月发生变化。滚动窗口必须是常数，即

df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.set_index('createDate').rolling('180D').sum())

                                grade
userId createDate                    
0      2016-05-08 22:00:49.673    2.0
       2016-07-23 12:37:11.570    9.0
       2017-01-03 12:05:33.060   14.0
1009   2016-06-27 09:28:19.677    5.0
       2016-07-23 12:37:11.570   13.0
       2017-01-03 12:05:33.060   17.0
       2017-02-08 16:17:17.547   13.0
2011   2016-11-03 14:30:25.390    6.0
       2016-12-15 21:06:14.730   17.0
       2017-01-04 20:22:31.423   19.0
       2017-08-08 16:17:17.547    7.0

编辑以供评论：

回顾6个月前，需要对日期进行排序。因此，您可能需要对值进行排序

 df.groupby(['userId'])['createDate','grade'].apply(lambda x : \
            x.sort_values('createDate').set_index('createDate').rolling('180D').sum())

根据@coldspeed的评论进行编辑：

使用apply是一种过度使用，请先设置索引，然后使用滚动求和：

df.set_index('createDate').groupby('userId').grade.rolling('‌180D').sum()

时间：

df = pd.concat([df]*1000)

%%timeit
df.set_index('createDate').groupby('userId').grade.rolling('180D').sum() 
100 loops, best of 3: 7.55 ms per loop

%%timeit
df.groupby(['userId'])['createDate','grade'].apply(lambda x : x.sort_values('createDate').set_index('createDate').rolling('180D').sum())
10 loops, best of 3: 19.5 ms per loop

谢谢你的回答。当我应用于完整数据集{ValueError:index必须是单调的}时，我遇到了这个错误。日期是否已排序？回顾6个月前，他们需要分类。若它们并没有被排序，那个么它给出的索引一定是单调的。绝对！非常感谢，非常感谢。我的数据集很大，在r3.8x大型实例上花费了很多时间。但我已经把它标为正确答案，并对它进行了投票。有没有更快的方法？这一台仍在运行，耗资1400万美元rows@dsl1990尝试

df.set_index（'createDate'）.groupby（'userId'）.grade.rolling（'180D'）.sum（）