Pandas-扩展平均会话时间_Pandas_Group By_Mean_Timedelta

Pandas-扩展平均会话时间

pandas

Pandas-扩展平均会话时间,pandas,group-by,mean,timedelta,Pandas,Group By,Mean,Timedelta,以下DF表示从用户接收的事件。用户Id和事件的时间戳： id timestamp 0 1 2020-09-01 18:14:35 1 1 2020-09-01 18:14:39 2 1 2020-09-01 18:14:40 3 1 2020-09-01 02:09:22 4 1 2020-09-01 02:09:35 5 1 2020-09-01 02:09:53 6 1 2020-09-01 02:09:57 7 2

以下DF表示从用户接收的事件。用户Id和事件的时间戳：

    id           timestamp
0    1 2020-09-01 18:14:35
1    1 2020-09-01 18:14:39
2    1 2020-09-01 18:14:40
3    1 2020-09-01 02:09:22
4    1 2020-09-01 02:09:35
5    1 2020-09-01 02:09:53
6    1 2020-09-01 02:09:57
7    2 2020-09-01 18:14:35
8    2 2020-09-01 18:14:39
9    2 2020-09-01 18:14:40
10   2 2020-09-01 02:09:22
11   2 2020-09-01 02:09:35
12   2 2020-09-01 02:09:53
13   2 2020-09-01 02:09:57

我想获得平均扩展会话时间。会话定义为中断时间超过5分钟的事件序列
我将会议分组如下：

df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')])
得到了正确的群体：

id timestamp 3 1 2020-09-01 02:09:22 4 1 2020-09-01 02:09:35 5 1 2020-09-01 02:09:53 6 1 2020-09-01 02:09:57 id timestamp 0 1 2020-09-01 18:14:35 1 1 2020-09-01 18:14:39 2 1 2020-09-01 18:14:40 id timestamp 10 2 2020-09-01 02:09:22 11 2 2020-09-01 02:09:35 12 2 2020-09-01 02:09:53 13 2 2020-09-01 02:09:57 id timestamp 7 2 2020-09-01 18:14:35 8 2 2020-09-01 18:14:39 9 2 2020-09-01 18:14:40
现在，我想计算任意给定行中每个用户的平均会话时间（以秒为单位），因此输出为：

id timestamp avg_session_time 0 1 2020-09-01 18:14:35 0 <-- first event 1 1 2020-09-01 18:14:39 4 <-- 2nd event after 4 seconds 2 1 2020-09-01 18:14:40 5 <-- 3rd event after 5 seconds --- session end 3 1 2020-09-01 02:09:22 5 <-- first event of second session 4 1 2020-09-01 02:09:35 9 <-- 2nd event after 13 seconds (13 seconds in the 2nd session + 5 in first session divide by the number of sessions 2) 5 1 2020-09-01 02:09:53 18 <-- 3rd event after 31 seconds ((31 + 5) / 2 = 18) 6 1 2020-09-01 02:09:57 20 <-- 4th event after 35 seconds ((35 + 5) / 2 = 20) --- 7 2 2020-09-01 18:14:35 0 8 2 2020-09-01 18:14:39 4 9 2 2020-09-01 18:14:40 5 --- 10 2 2020-09-01 02:09:22 5 11 2 2020-09-01 02:09:35 9 12 2 2020-09-01 02:09:53 18 13 2 2020-09-01 02:09:57 20

id时间戳平均会话时间 01 2020-09-01 18:14:35使用：非常感谢你！你是国王：） #converting to datetimes df['timestamp'] = pd.to_datetime(df['timestamp']) #grouping per 5Min and id g = df.groupby(['id', pd.Grouper(key="timestamp", freq='5min', origin='start')]) #get first values per groups to new column df['diff'] = g['timestamp'].transform('first') #subtract by timestamp and convert timedeltas to seconds df['diff'] = df['timestamp'].sub(df['diff']).dt.total_seconds() #shifting per groups by id df['new'] = df.groupby('id')['diff'].shift() #get first value per groups, now shifted df['new'] = g['new'].transform('first') #replace 0 to misisng values and get average df['last'] = df[['new','diff']].replace(0, np.nan).mean(axis=1).fillna(df['new']) print (df) id timestamp diff new last 0 1 2020-09-01 18:14:35 0.0 0.0 0.0 1 1 2020-09-01 18:14:39 4.0 0.0 4.0 2 1 2020-09-01 18:14:40 5.0 0.0 5.0 3 1 2020-09-01 02:09:22 0.0 5.0 5.0 4 1 2020-09-01 02:09:35 13.0 5.0 9.0 5 1 2020-09-01 02:09:53 31.0 5.0 18.0 6 1 2020-09-01 02:09:57 35.0 5.0 20.0 7 2 2020-09-01 18:14:35 0.0 0.0 0.0 8 2 2020-09-01 18:14:39 4.0 0.0 4.0 9 2 2020-09-01 18:14:40 5.0 0.0 5.0 10 2 2020-09-01 02:09:22 0.0 5.0 5.0 11 2 2020-09-01 02:09:35 13.0 5.0 9.0 12 2 2020-09-01 02:09:53 31.0 5.0 18.0 13 2 2020-09-01 02:09:57 35.0 5.0 20.0