Python:按一列分组,从另一列获取计数

Python:按一列分组,从另一列获取计数,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我有一个数据集(如下),我想按用户id对数据进行分组,并获得每个用户id的每个集群标签的计数。其目的是了解每个用户访问每个集群的次数 基本上,我正在寻找返回此信息的结果(可以是列表、dict或逗号分隔的): 我尝试了以下代码: data['user_id'] = data.index result = data.groupby(['user_id','cluster_label']).count() 及 第二个代码块使我更接近我要查找的内容,但我无法计算出计数部分: 790068 [[

我有一个数据集(如下),我想按
用户id
对数据进行分组,并获得每个
用户id
的每个
集群标签的计数。其目的是了解每个用户访问每个集群的次数

基本上,我正在寻找返回此信息的结果(可以是列表、dict或逗号分隔的):

我尝试了以下代码:

data['user_id'] = data.index
result = data.groupby(['user_id','cluster_label']).count() 

第二个代码块使我更接近我要查找的内容,但我无法计算出计数部分:

790068    [[485, 256, 304, 311, 311, 311, 311, 417, 417]]
数据:


我认为您可以使用替代方法进行计数,并通过替换缺失值或不替换来重塑:

result = data.groupby(['user_id','cluster_label']).size().unstack(fill_value=0)
print (result)
cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000    0    1    0    0    0    1    0    2    1    0  ...   
820000000000000000    0    0    1    0    2    0    1    0    0    0  ...   
821000000000000000    1    0    0    1    0    0    0    0    0    1  ...   
822000000000000000    0    0    0    0    0    0    0    0    0    0  ...   

cluster_label       278  290  327  413  432  438  485  521  565  634  
user_id                                                               
819000000000000000    1    1    0    0    0    0    0    0    0    0  
820000000000000000    0    0    0    0    1    0    0    0    0    0  
821000000000000000    0    0    1    0    0    1    1    1    1    0  
822000000000000000    0    0    0   15    0    0    0    0    0    2  

[4 rows x 23 columns]

或使用:


谢谢,这太棒了!我不知道
系列。unstack
,但我看到它执行一个轴心,非常有用。然而,我在更大的数据集上运行了这个,我逐渐意识到这可能不是我最终尝试实现的最佳解决方案,因为我有700多个集群,这个解决方案的结果是一个非常大且稀疏的矩阵。我会接受这一点,并继续探索其他更有效(希望)的解决方案。
790068    [[485, 256, 304, 311, 311, 311, 311, 417, 417]]
user_id,timestamp,latitude,longitude,cluster_label
822000000000000000,3/28/2017 22:31,38.7842,-77164,634
822000000000000000,3/28/2017 22:44,38.7842,-77164,634
822000000000000000,3/29/2017 8:02,38.8976805,-77387238,413
822000000000000000,3/29/2017 8:21,38.8976805,-77387238,413
822000000000000000,3/29/2017 19:58,38.8976805,-77387238,413
822000000000000000,3/29/2017 22:12,38.8976805,-77387238,413
822000000000000000,3/30/2017 9:07,38.8976805,-77387238,413
822000000000000000,3/30/2017 10:27,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:17,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:19,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:19,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:20,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:22,38.8976805,-77387238,413
822000000000000000,3/30/2017 18:16,38.8976805,-77387238,413
822000000000000000,3/30/2017 18:17,38.8976805,-77387238,413
822000000000000000,3/30/2017 21:43,38.8976805,-77387238,413
822000000000000000,3/31/2017 7:04,38.8976805,-77387238,413
821000000000000000,3/9/2017 19:06,39.1328,-76.694,35
821000000000000000,3/9/2017 19:07,393426644,-76.6874899,90
821000000000000000,3/9/2017 19:07,38.93730032,-778885944,207
821000000000000000,3/9/2017 19:07,38.9071923,-77368707,327
821000000000000000,3/9/2017 19:06,38.8940974,-77276216,438
821000000000000000,3/9/2017 19:07,38.882584,-77.1124701,521
821000000000000000,3/9/2017 19:08,38.8577901,-76.8538565,565
821000000000000000,3/27/2017 21:12,38.888108,-771978416,485
820000000000000000,3/9/2017 19:09,39535541,-77.1347642,77
820000000000000000,3/9/2017 19:08,38.9847,-77.1131,143
820000000000000000,3/22/2017 14:26,38.8951,-77367,432
820000000000000000,3/24/2017 19:13,39227,-77.1864,98
820000000000000000,3/30/2017 7:39,39227,-77.1864,98
819000000000000000,3/9/2017 19:09,39942239,-76.85709,54
819000000000000000,3/9/2017 19:11,39042,-7719,109
819000000000000000,3/9/2017 19:16,38.95315,-77.447735,191
819000000000000000,3/9/2017 19:10,38.95278983,-77.44791904,191
819000000000000000,3/9/2017 19:12,38.94033497,-77.17591993,204
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:15,38.91778,-76.9769,263
819000000000000000,3/9/2017 19:12,38.916489,-77318051,264
819000000000000000,3/9/2017 19:12,38.915147,-77217751,278
819000000000000000,3/9/2017 19:15,38.912068,-77190228,290
result = data.groupby(['user_id','cluster_label']).size().unstack(fill_value=0)
print (result)
cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000    0    1    0    0    0    1    0    2    1    0  ...   
820000000000000000    0    0    1    0    2    0    1    0    0    0  ...   
821000000000000000    1    0    0    1    0    0    0    0    0    1  ...   
822000000000000000    0    0    0    0    0    0    0    0    0    0  ...   

cluster_label       278  290  327  413  432  438  485  521  565  634  
user_id                                                               
819000000000000000    1    1    0    0    0    0    0    0    0    0  
820000000000000000    0    0    0    0    1    0    0    0    0    0  
821000000000000000    0    0    1    0    0    1    1    1    1    0  
822000000000000000    0    0    0   15    0    0    0    0    0    2  

[4 rows x 23 columns]
result = data.groupby(['user_id','cluster_label']).size().unstack()
print (result)

cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000  NaN  1.0  NaN  NaN  NaN  1.0  NaN  2.0  1.0  NaN  ...   
820000000000000000  NaN  NaN  1.0  NaN  2.0  NaN  1.0  NaN  NaN  NaN  ...   
821000000000000000  1.0  NaN  NaN  1.0  NaN  NaN  NaN  NaN  NaN  1.0  ...   
822000000000000000  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  NaN  ...   

cluster_label       278  290  327   413  432  438  485  521  565  634  
user_id                                                                
819000000000000000  1.0  1.0  NaN   NaN  NaN  NaN  NaN  NaN  NaN  NaN  
820000000000000000  NaN  NaN  NaN   NaN  1.0  NaN  NaN  NaN  NaN  NaN  
821000000000000000  NaN  NaN  1.0   NaN  NaN  1.0  1.0  1.0  1.0  NaN  
822000000000000000  NaN  NaN  NaN  15.0  NaN  NaN  NaN  NaN  NaN  2.0  

[4 rows x 23 columns]
result = pd.crosstab(data['user_id'],data['cluster_label'])
print (result)
cluster_label       35   54   77   90   98   109  143  191  204  207  ...  \
user_id                                                               ...   
819000000000000000    0    1    0    0    0    1    0    2    1    0  ...   
820000000000000000    0    0    1    0    2    0    1    0    0    0  ...   
821000000000000000    1    0    0    1    0    0    0    0    0    1  ...   
822000000000000000    0    0    0    0    0    0    0    0    0    0  ...   

cluster_label       278  290  327  413  432  438  485  521  565  634  
user_id                                                               
819000000000000000    1    1    0    0    0    0    0    0    0    0  
820000000000000000    0    0    0    0    1    0    0    0    0    0  
821000000000000000    0    0    1    0    0    1    1    1    1    0  
822000000000000000    0    0    0   15    0    0    0    0    0    2  

[4 rows x 23 columns]