Python 日期时间:基于时间延迟的聚合
我有一个3列的数据框,包含id、时间戳和事件类型:Python 日期时间:基于时间延迟的聚合,python,pandas,datetime,group-by,aggregate,Python,Pandas,Datetime,Group By,Aggregate,我有一个3列的数据框,包含id、时间戳和事件类型: id timestamp event_type ___________________________ 0 1 2019-10-01 E1 1 1 2019-10-03 E3 2 2 2019-10-04 E3 3 2 2019-10-05 E4 4 2 2019-10-06 E1 5
id timestamp event_type
___________________________
0 1 2019-10-01 E1
1 1 2019-10-03 E3
2 2 2019-10-04 E3
3 2 2019-10-05 E4
4 2 2019-10-06 E1
5 1 2019-10-07 E3
6 1 2019-10-07 E4
7 1 2019-10-13 E3
8 2 2019-10-22 E5
我正在寻找一种聚合它的方法,以便将属于相同id且具有X=3
即,第0行和第1行应产生一个列表,因为它们的时间戳间隔不超过3天
因此,我期望的输出如下:
id2 event_hist
_______________
0 1-1 [E1, E3]
1 2-1 [E3, E4, E1]
2 1-2 [E3, E4]
3 1-3 [E3]
4 2-2 [E5]
id2列只是第一个数据帧的id,为每个新序列迭代
我可以编写一个函数来实现期望的结果,但是有内置的方法吗?
获得所需输出的最具python风格的方式是什么 如果列时间戳不是datetime,则从它开始
df['timestamp'] = pd.to_datetime(df['timestamp'])
t = df.groupby('id').apply(lambda g: g.rolling('3d', on='timestamp').count())
new = df.groupby(t['id'].le(t.shift()['id']).cumsum()) \
.agg(event_hist=('event_type', list), id2=('id', 'first'))
new['id2'] = new['id2'].astype(str) + \
'-' + \
new.groupby('id2').cumcount().add(1).astype(str)
导致
hist id2
id
0 [E1, E3] 1-1
1 [E3, E4, E1] 2-1
2 [E3, E4] 1-2
3 [E3] 1-3
4 [E5] 2-2
如果列时间戳不是datetime,则从该列开始
df['timestamp'] = pd.to_datetime(df['timestamp'])
t = df.groupby('id').apply(lambda g: g.rolling('3d', on='timestamp').count())
new = df.groupby(t['id'].le(t.shift()['id']).cumsum()) \
.agg(event_hist=('event_type', list), id2=('id', 'first'))
new['id2'] = new['id2'].astype(str) + \
'-' + \
new.groupby('id2').cumcount().add(1).astype(str)
导致
hist id2
id
0 [E1, E3] 1-1
1 [E3, E4, E1] 2-1
2 [E3, E4] 1-2
3 [E3] 1-3
4 [E5] 2-2
我找到了我的问题的答案,这似乎有效。虽然我认为@splash58提出的答案更有效,使用更少的行和更多的内置函数
def get_aggregate_by_lag(df, idcol, datecol, valuecol, max_lag):
import pandas as pd
res_dict = {}
for id in df[idcol].unique():
sub_df = df[df[idcol] == id].reset_index(drop=True)
current_sequence = [sub_df[valuecol][0]]
sequence_counter = 1
if len(sub_df) == 1:
res_dict[f"{id}-{sequence_counter}"] = [current_sequence]
continue
for i in range(1,len(sub_df)):
if (sub_df[datecol][i] - sub_df[datecol][i-1]).days <= max_lag:
current_sequence.append(sub_df[valuecol][i])
if i == len(sub_df)-1:
res_dict[f"{id}-{sequence_counter}"] = [current_sequence]
else:
res_dict[f"{id}-{sequence_counter}"] = [current_sequence]
sequence_counter += 1
current_sequence = [sub_df[valuecol][i]]
return pd.DataFrame.from_dict(res_dict, columns=["hist"], orient="Index")
def按延迟获取聚合(df、idcol、datecol、valuecol、max_lag):
作为pd进口熊猫
res_dict={}
对于df[idcol].unique()中的id:
sub_df=df[df[idcol]==id]。重置_索引(drop=True)
当前_序列=[sub_df[valuecol][0]]
顺序计数器=1
如果len(sub_df)==1:
res_dict[f{id}-{sequence_counter}]=[current_sequence]
持续
对于范围(1,len(sub_df))内的i:
如果(sub_-df[datecol][i]-sub_-df[datecol][i-1])。天我找到了我问题的答案,这似乎有效。虽然我认为@splash58提出的答案更有效,使用更少的行和更多的内置函数
def get_aggregate_by_lag(df, idcol, datecol, valuecol, max_lag):
import pandas as pd
res_dict = {}
for id in df[idcol].unique():
sub_df = df[df[idcol] == id].reset_index(drop=True)
current_sequence = [sub_df[valuecol][0]]
sequence_counter = 1
if len(sub_df) == 1:
res_dict[f"{id}-{sequence_counter}"] = [current_sequence]
continue
for i in range(1,len(sub_df)):
if (sub_df[datecol][i] - sub_df[datecol][i-1]).days <= max_lag:
current_sequence.append(sub_df[valuecol][i])
if i == len(sub_df)-1:
res_dict[f"{id}-{sequence_counter}"] = [current_sequence]
else:
res_dict[f"{id}-{sequence_counter}"] = [current_sequence]
sequence_counter += 1
current_sequence = [sub_df[valuecol][i]]
return pd.DataFrame.from_dict(res_dict, columns=["hist"], orient="Index")
def按延迟获取聚合(df、idcol、datecol、valuecol、max_lag):
作为pd进口熊猫
res_dict={}
对于df[idcol].unique()中的id:
sub_df=df[df[idcol]==id]。重置_索引(drop=True)
当前_序列=[sub_df[valuecol][0]]
顺序计数器=1
如果len(sub_df)==1:
res_dict[f{id}-{sequence_counter}]=[current_sequence]
持续
对于范围(1,len(sub_df))内的i:
if(sub_-df[datecol][i]-sub_-df[datecol][i-1])。天