具有多个条件和列的Pandas Groupby和cumsum-Python
我有以下数据帧:具有多个条件和列的Pandas Groupby和cumsum-Python,python,pandas,pandas-groupby,cumsum,Python,Pandas,Pandas Groupby,Cumsum,我有以下数据帧: import pandas as pd hits = {'id': ['A','A','A','A','A','A','B','B','B','C','C','C'], 'datetime': ['2010-01-02 03:00:00','2010-01-02 03:00:14','2010-01-02 03:00:35','2010-01-02 03:00:38', '2010-01-02 03:29:10','20
import pandas as pd
hits = {'id': ['A','A','A','A','A','A','B','B','B','C','C','C'],
'datetime': ['2010-01-02 03:00:00','2010-01-02 03:00:14','2010-01-02 03:00:35','2010-01-02 03:00:38',
'2010-01-02 03:29:10','2010-01-02 03:29:35','2010-01-02 03:45:20','2010-01-02 06:10:05',
'2010-01-02 06:10:15','2010-01-02 07:40:15','2010-01-02 07:40:20','2010-01-02 07:40:25'],
'uri_len': [10,20,25,15,20,10,20,25,15,30,40,45]
}
df = pd.DataFrame(hits, columns = ['id', 'datetime','uri_len'])
df['datetime'] = pd.to_datetime(df['datetime'], format='%Y-%m-%d %H:%M:%S')
print (df)
id datetime uri_len
0 A 2010-01-02 03:00:00 10
1 A 2010-01-02 03:00:14 20
2 A 2010-01-02 03:00:35 25
3 A 2010-01-02 03:00:38 15
4 A 2010-01-02 03:29:10 20
5 A 2010-01-02 03:29:35 10
6 B 2010-01-02 03:45:20 20
7 B 2010-01-02 06:10:05 25
8 B 2010-01-02 06:10:15 15
9 C 2010-01-02 07:40:15 30
10 C 2010-01-02 07:40:20 40
11 C 2010-01-02 07:40:25 45
我想使用id
作为分组变量,按会话对点击进行分组。对我来说,会话是一个超过15秒的非活动期(根据datetime
列计算),或者是uri\u len
列的减少,并且在这两种情况下都比较连续的点击
我知道如何根据每个条件单独分组:
df['session1'] = (df.groupby('id')['datetime']
.transform(lambda x: x.diff().gt('15Sec').cumsum())
)
df['session2'] = (df.groupby('id')['uri_len']
.transform(lambda x: x.diff().lt(0).cumsum())
)
有没有一种方法可以在同一行中组合两个转换,从而直接输出以下内容
id datetime uri_len session
0 A 2010-01-02 03:00:00 10 0
1 A 2010-01-02 03:00:14 20 0
2 A 2010-01-02 03:00:35 25 1
3 A 2010-01-02 03:00:38 15 2
4 A 2010-01-02 03:29:10 20 3
5 A 2010-01-02 03:29:35 10 4
6 B 2010-01-02 03:45:20 20 0
7 B 2010-01-02 06:10:05 25 1
8 B 2010-01-02 06:10:15 15 2
9 C 2010-01-02 07:40:15 30 0
10 C 2010-01-02 07:40:20 40 0
11 C 2010-01-02 07:40:25 45 0
如果我理解正确,您想添加它们吗
df['session'] = df.groupby('id')['datetime'].transform(lambda x:
x.diff().gt('15Sec').cumsum()) + df.groupby('id')['uri_len'].transform(lambda x:
x.diff().lt(0).cumsum())
更明确的方式是:
s1 = df.groupby('id')['datetime'].transform(lambda x:
x.diff().gt('15Sec').cumsum())
s2 = df.groupby('id')['uri_len'].transform(lambda x: x.diff().lt(0).cumsum())
df['session'] = s1+s2