Python 获取具有原始索引的重复行计数
我需要在一个数据帧中找到重复的行,然后添加一个带有count的额外列。假设我们有一个数据帧:Python 获取具有原始索引的重复行计数,python,pandas,group-by,aggregate,multiple-columns,Python,Pandas,Group By,Aggregate,Multiple Columns,我需要在一个数据帧中找到重复的行,然后添加一个带有count的额外列。假设我们有一个数据帧: >>print(df) +----+-----+-----+-----+-----+-----+-----+-----+-----+ | | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | |----+-----+-----+-----+-----+-----+-----+-----+-----| | 0 | 0 | 0 |
>>print(df)
+----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 |
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+
然后,上面的框架将变成下面的框架,并带有一个带有count的附加列。您可以看到,我们仍然保留索引列
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
我见过其他解决方案,如:
df.groupby(list(df.columns.values)).size()
但这将返回一个有间隙且没有初始索引的矩阵。您可以先将索引
转换为列,然后再通过首先
和len
:
此外,如果需要按所有列分组,请按以下方式删除索引
列:
如有必要,添加下一列10
needrename
:
#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None)
.rename(columns={'size':last_col}))
2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1
很高兴能帮助你!
#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None)
.rename(columns={'size':last_col}))
2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1