Python:按一列分组,从另一列获取计数
我有一个数据集(如下),我想按Python:按一列分组,从另一列获取计数,python,pandas,pandas-groupby,Python,Pandas,Pandas Groupby,我有一个数据集(如下),我想按用户id对数据进行分组,并获得每个用户id的每个集群标签的计数。其目的是了解每个用户访问每个集群的次数 基本上,我正在寻找返回此信息的结果(可以是列表、dict或逗号分隔的): 我尝试了以下代码: data['user_id'] = data.index result = data.groupby(['user_id','cluster_label']).count() 及 第二个代码块使我更接近我要查找的内容,但我无法计算出计数部分: 790068 [[
用户id
对数据进行分组,并获得每个用户id
的每个集群标签的计数。其目的是了解每个用户访问每个集群的次数
基本上,我正在寻找返回此信息的结果(可以是列表、dict或逗号分隔的):
我尝试了以下代码:
data['user_id'] = data.index
result = data.groupby(['user_id','cluster_label']).count()
及
第二个代码块使我更接近我要查找的内容,但我无法计算出计数部分:
790068 [[485, 256, 304, 311, 311, 311, 311, 417, 417]]
数据:
我认为您可以使用替代方法进行计数,并通过替换缺失值或不替换来重塑:
result = data.groupby(['user_id','cluster_label']).size().unstack(fill_value=0)
print (result)
cluster_label 35 54 77 90 98 109 143 191 204 207 ... \
user_id ...
819000000000000000 0 1 0 0 0 1 0 2 1 0 ...
820000000000000000 0 0 1 0 2 0 1 0 0 0 ...
821000000000000000 1 0 0 1 0 0 0 0 0 1 ...
822000000000000000 0 0 0 0 0 0 0 0 0 0 ...
cluster_label 278 290 327 413 432 438 485 521 565 634
user_id
819000000000000000 1 1 0 0 0 0 0 0 0 0
820000000000000000 0 0 0 0 1 0 0 0 0 0
821000000000000000 0 0 1 0 0 1 1 1 1 0
822000000000000000 0 0 0 15 0 0 0 0 0 2
[4 rows x 23 columns]
或使用:
谢谢,这太棒了!我不知道系列。unstack
,但我看到它执行一个轴心,非常有用。然而,我在更大的数据集上运行了这个,我逐渐意识到这可能不是我最终尝试实现的最佳解决方案,因为我有700多个集群,这个解决方案的结果是一个非常大且稀疏的矩阵。我会接受这一点,并继续探索其他更有效(希望)的解决方案。
790068 [[485, 256, 304, 311, 311, 311, 311, 417, 417]]
user_id,timestamp,latitude,longitude,cluster_label
822000000000000000,3/28/2017 22:31,38.7842,-77164,634
822000000000000000,3/28/2017 22:44,38.7842,-77164,634
822000000000000000,3/29/2017 8:02,38.8976805,-77387238,413
822000000000000000,3/29/2017 8:21,38.8976805,-77387238,413
822000000000000000,3/29/2017 19:58,38.8976805,-77387238,413
822000000000000000,3/29/2017 22:12,38.8976805,-77387238,413
822000000000000000,3/30/2017 9:07,38.8976805,-77387238,413
822000000000000000,3/30/2017 10:27,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:17,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:19,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:19,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:20,38.8976805,-77387238,413
822000000000000000,3/30/2017 17:22,38.8976805,-77387238,413
822000000000000000,3/30/2017 18:16,38.8976805,-77387238,413
822000000000000000,3/30/2017 18:17,38.8976805,-77387238,413
822000000000000000,3/30/2017 21:43,38.8976805,-77387238,413
822000000000000000,3/31/2017 7:04,38.8976805,-77387238,413
821000000000000000,3/9/2017 19:06,39.1328,-76.694,35
821000000000000000,3/9/2017 19:07,393426644,-76.6874899,90
821000000000000000,3/9/2017 19:07,38.93730032,-778885944,207
821000000000000000,3/9/2017 19:07,38.9071923,-77368707,327
821000000000000000,3/9/2017 19:06,38.8940974,-77276216,438
821000000000000000,3/9/2017 19:07,38.882584,-77.1124701,521
821000000000000000,3/9/2017 19:08,38.8577901,-76.8538565,565
821000000000000000,3/27/2017 21:12,38.888108,-771978416,485
820000000000000000,3/9/2017 19:09,39535541,-77.1347642,77
820000000000000000,3/9/2017 19:08,38.9847,-77.1131,143
820000000000000000,3/22/2017 14:26,38.8951,-77367,432
820000000000000000,3/24/2017 19:13,39227,-77.1864,98
820000000000000000,3/30/2017 7:39,39227,-77.1864,98
819000000000000000,3/9/2017 19:09,39942239,-76.85709,54
819000000000000000,3/9/2017 19:11,39042,-7719,109
819000000000000000,3/9/2017 19:16,38.95315,-77.447735,191
819000000000000000,3/9/2017 19:10,38.95278983,-77.44791904,191
819000000000000000,3/9/2017 19:12,38.94033497,-77.17591993,204
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:09,38.917866,-7723722,260
819000000000000000,3/9/2017 19:15,38.91778,-76.9769,263
819000000000000000,3/9/2017 19:12,38.916489,-77318051,264
819000000000000000,3/9/2017 19:12,38.915147,-77217751,278
819000000000000000,3/9/2017 19:15,38.912068,-77190228,290
result = data.groupby(['user_id','cluster_label']).size().unstack(fill_value=0)
print (result)
cluster_label 35 54 77 90 98 109 143 191 204 207 ... \
user_id ...
819000000000000000 0 1 0 0 0 1 0 2 1 0 ...
820000000000000000 0 0 1 0 2 0 1 0 0 0 ...
821000000000000000 1 0 0 1 0 0 0 0 0 1 ...
822000000000000000 0 0 0 0 0 0 0 0 0 0 ...
cluster_label 278 290 327 413 432 438 485 521 565 634
user_id
819000000000000000 1 1 0 0 0 0 0 0 0 0
820000000000000000 0 0 0 0 1 0 0 0 0 0
821000000000000000 0 0 1 0 0 1 1 1 1 0
822000000000000000 0 0 0 15 0 0 0 0 0 2
[4 rows x 23 columns]
result = data.groupby(['user_id','cluster_label']).size().unstack()
print (result)
cluster_label 35 54 77 90 98 109 143 191 204 207 ... \
user_id ...
819000000000000000 NaN 1.0 NaN NaN NaN 1.0 NaN 2.0 1.0 NaN ...
820000000000000000 NaN NaN 1.0 NaN 2.0 NaN 1.0 NaN NaN NaN ...
821000000000000000 1.0 NaN NaN 1.0 NaN NaN NaN NaN NaN 1.0 ...
822000000000000000 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ...
cluster_label 278 290 327 413 432 438 485 521 565 634
user_id
819000000000000000 1.0 1.0 NaN NaN NaN NaN NaN NaN NaN NaN
820000000000000000 NaN NaN NaN NaN 1.0 NaN NaN NaN NaN NaN
821000000000000000 NaN NaN 1.0 NaN NaN 1.0 1.0 1.0 1.0 NaN
822000000000000000 NaN NaN NaN 15.0 NaN NaN NaN NaN NaN 2.0
[4 rows x 23 columns]
result = pd.crosstab(data['user_id'],data['cluster_label'])
print (result)
cluster_label 35 54 77 90 98 109 143 191 204 207 ... \
user_id ...
819000000000000000 0 1 0 0 0 1 0 2 1 0 ...
820000000000000000 0 0 1 0 2 0 1 0 0 0 ...
821000000000000000 1 0 0 1 0 0 0 0 0 1 ...
822000000000000000 0 0 0 0 0 0 0 0 0 0 ...
cluster_label 278 290 327 413 432 438 485 521 565 634
user_id
819000000000000000 1 1 0 0 0 0 0 0 0 0
820000000000000000 0 0 0 0 1 0 0 0 0 0
821000000000000000 0 0 1 0 0 1 1 1 1 0
822000000000000000 0 0 0 15 0 0 0 0 0 2
[4 rows x 23 columns]