Python 重复组上的不同分组累积和
我有以下数据帧:Python 重复组上的不同分组累积和,python,pandas,grouping,Python,Pandas,Grouping,我有以下数据帧: Title start_time Duration Match 0 Item#1 2019-12-13 00:00:00.000 819.01 True 2 Item#1 2019-12-13 00:13:39.010 1205.25 True 4 Item#1 2019-12-13 00:33:44.260 972.80 True 6 Item#1 2019-12-13 0
Title start_time Duration Match
0 Item#1 2019-12-13 00:00:00.000 819.01 True
2 Item#1 2019-12-13 00:13:39.010 1205.25 True
4 Item#1 2019-12-13 00:33:44.260 972.80 True
6 Item#1 2019-12-13 00:49:57.060 602.23 False
9 Item#2 2019-12-13 00:59:59.290 1800.00 False
14 Item#2 2019-12-13 01:29:59.290 533.79 True
17 Item#2 2019-12-13 01:38:53.080 537.11 True
20 Item#2 2019-12-13 01:47:50.190 729.10 False
24 Item#3 2019-12-13 01:59:59.290 726.97 True
26 Item#3 2019-12-13 02:12:06.260 569.01 True
28 Item#3 2019-12-13 02:21:35.270 504.02 False
32 Item#4 2019-12-13 02:29:59.290 1800.00 False
36 Item#1 2019-12-13 02:59:59.290 776.98 True
38 Item#1 2019-12-13 03:12:56.270 1045.81 True
40 Item#1 2019-12-13 03:30:22.080 988.20 True
43 Item#1 2019-12-13 03:46:50.280 789.01 False
我想在duration列上运行一个累积和,到目前为止,我使用了以下代码行:
df.groupby(['Title'])['Duration'].cumsum()
但是,我不想对时间上分开的标题项进行分组。看看上面的例子,我不想把第#1项分成两组。我该怎么做 我认为您需要按连续组分组,这意味着
Item#1
的处理过程类似于两组:
g = df['Title'].ne(df['Title'].shift()).cumsum()
df['new'] = df.groupby(g)['Duration'].cumsum()
print (df)
Title start_time Duration Match new
0 Item#1 2019-12-13 00:00:00.000 819.01 True 819.01
2 Item#1 2019-12-13 00:13:39.010 1205.25 True 2024.26
4 Item#1 2019-12-13 00:33:44.260 972.80 True 2997.06
6 Item#1 2019-12-13 00:49:57.060 602.23 False 3599.29
9 Item#2 2019-12-13 00:59:59.290 1800.00 False 1800.00
14 Item#2 2019-12-13 01:29:59.290 533.79 True 2333.79
17 Item#2 2019-12-13 01:38:53.080 537.11 True 2870.90
20 Item#2 2019-12-13 01:47:50.190 729.10 False 3600.00
24 Item#3 2019-12-13 01:59:59.290 726.97 True 726.97
26 Item#3 2019-12-13 02:12:06.260 569.01 True 1295.98
28 Item#3 2019-12-13 02:21:35.270 504.02 False 1800.00
32 Item#4 2019-12-13 02:29:59.290 1800.00 False 1800.00
36 Item#1 2019-12-13 02:59:59.290 776.98 True 776.98
38 Item#1 2019-12-13 03:12:56.270 1045.81 True 1822.79
40 Item#1 2019-12-13 03:30:22.080 988.20 True 2810.99
43 Item#1 2019-12-13 03:46:50.280 789.01 False 3600.00
详细信息:
您可以按列比较,对于不相等组,可以按列比较,对于累积组,可以按添加:
print (df[['Title']].assign(shifted = df['Title'].shift(),
not_equal=df['Title'].ne(df['Title'].shift()),
g = df['Title'].ne(df['Title'].shift()).cumsum()))
Title shifted not_equal g
0 Item#1 NaN True 1
2 Item#1 Item#1 False 1
4 Item#1 Item#1 False 1
6 Item#1 Item#1 False 1
9 Item#2 Item#1 True 2
14 Item#2 Item#2 False 2
17 Item#2 Item#2 False 2
20 Item#2 Item#2 False 2
24 Item#3 Item#2 True 3
26 Item#3 Item#3 False 3
28 Item#3 Item#3 False 3
32 Item#4 Item#3 True 4
36 Item#1 Item#4 True 5
38 Item#1 Item#1 False 5
40 Item#1 Item#1 False 5
43 Item#1 Item#1 False 5
你能解释得更详细些吗?@SimonBreton-不确定是否理解这个问题,补充了一些解释和细节。是的。听起来不错。你能解释一下
.ne
和移位
吗?