Python 熊猫在组内排序,然后聚集
我正在做搜索引擎的查询分析。用户可以在一个会话中的不同时间在google搜索引擎上逐个搜索不同的查询 我有几个字段的数据:Python 熊猫在组内排序,然后聚集,python,pandas,dataframe,sorting,group-by,Python,Pandas,Dataframe,Sorting,Group By,我正在做搜索引擎的查询分析。用户可以在一个会话中的不同时间在google搜索引擎上逐个搜索不同的查询 我有几个字段的数据:session\u id,log\u time,query,feature\u I,等等。我想按session\u id分组,然后按log\u time的顺序将几行压缩成一行。因此,输出数据将以时间序列的方式表示用户的行为 数据集 代码: 输出: session_id log_time query cate_feat_0 num_feat_0 0
session\u id
,log\u time
,query
,feature\u I
,等等。我想按session\u id
分组,然后按log\u time
的顺序将几行压缩成一行。因此,输出数据将以时间序列的方式表示用户的行为
数据集
代码:
输出:
session_id log_time query cate_feat_0 num_feat_0
0 1 4 hi apple 1
1 2 5 dude banana 2
2 1 6 pandas apple 3
3 2 1 groupby banana 4
4 3 2 sort apple 5
5 3 3 agg banana 6
我想要的是:
## note that all list are sorted by log time with each session_id group
session_id query_list log_time_list cate_feat_0_list num_feat_0_list
1 [hi, pandas] [4,6] [apple, apple] [1,3]
2 [groupby, dude] [1,5] [banana, banana] [4,2]
3 [sort,agg] [2,3] [apple, banana] [5,6]
我的尝试
首先,我们使用代码编写groupby和agg:
toy_data_res = toy_data.groupby('session_id').agg({'query':list, 'log_time':list, 'cate_feat_0':list, 'num_feat_0':list})
toy_data_res
for i in toy_data_res.index:
sort_index = np.argsort(toy_data_res.loc[i,'log_time']) ## get time order with in group
for col in toy_data_res.columns.values:
toy_data_res.loc[i,col] = [toy_data_res.loc[i,col][j] for j in sort_index] ## sort values in cols
toy_data_res
给出:
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [dude, groupby] [5, 1] [banana, banana] [2, 4]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
然后,我们在每个会话中使用代码进行排序:
toy_data_res = toy_data.groupby('session_id').agg({'query':list, 'log_time':list, 'cate_feat_0':list, 'num_feat_0':list})
toy_data_res
for i in toy_data_res.index:
sort_index = np.argsort(toy_data_res.loc[i,'log_time']) ## get time order with in group
for col in toy_data_res.columns.values:
toy_data_res.loc[i,col] = [toy_data_res.loc[i,col][j] for j in sort_index] ## sort values in cols
toy_data_res
给出:
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [dude, groupby] [5, 1] [banana, banana] [2, 4]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
我的方法是快-慢。有没有更好的方法来执行groupby->sort with in group->aggregation
小贴士:
在groupby
之前使用,如果需要可以应用相同的功能,请使用列名称列表:
df = (toy_data.sort_values(['session_id','log_time'])
.groupby('session_id')[['query','log_time','cate_feat_0', 'num_feat_0']]
.agg(list))
print (df)
query log_time cate_feat_0 num_feat_0
session_id
1 [hi, pandas] [4, 6] [apple, apple] [1, 3]
2 [groupby, dude] [1, 5] [banana, banana] [4, 2]
3 [sort, agg] [2, 3] [apple, banana] [5, 6]
在groupby之前,尝试按会话id和日志时间进行排序
df = pd.DataFrame({'session_id':[1,2,1,2,3,3,],
'log_time':[4,5,6,1,2,3],
'query':['hi','dude','pandas','groupby','sort','agg'],
'cate_feat_0':['apple','banana']*3,
'num_feat_0':[1,2,3,4,5,6]})
df=df.sort_values(by=['session_id','log_time'])
grouped=df.groupby('session_id')
['log_time','query','cate_feat_0','num_feat_0'].agg(list)
print(grouped)
输出
log_time query cate_feat_0 num_feat_0
session_id
1 [4, 6] [hi, pandas] [apple, apple] [1, 3]
2 [1, 5] [groupby, dude] [banana, banana] [4, 2]
3 [2, 3] [sort, agg] [apple, banana] [5, 6]
谢谢你,兄弟!我以前心碎了,没有考虑过预排序方法。记住,如果它解决了您的问题,您可以接受答案:)