Python 大熊猫多栏排名和观察结果之间的标记关系
我有一个类似于:Python 大熊猫多栏排名和观察结果之间的标记关系,python,pandas,Python,Pandas,我有一个类似于: group name sum count max_size 1 1 aaa 3 2 4 2 1 bbb 3 1 4 3 1 ccc 2 2 4 4 1 ddd 2 2 4 5 1 eee 1 0 4
group name sum count max_size
1 1 aaa 3 2 4
2 1 bbb 3 1 4
3 1 ccc 2 2 4
4 1 ddd 2 2 4
5 1 eee 1 0 4
6 2 aaa 3 2 3
7 2 bbb 3 1 3
8 2 ccc 2 3 3
9 2 ddd 2 1 3
10 3 aaa 3 4 4
11 3 bbb 3 2 4
12 3 ccc 2 5 4
13 3 ddd 2 1 4
14 3 eee 2 1 4
15 3 fff 2 1 4
我想根据这一决策推理为每个观察结果贴上标签:
- 首先按组排列groupby(),然后按和排列名称,然后按计数降序排列名称
- 在
中选择前n个元素,这是组中要选择的最大元素数max_size
组2
的情况下,有一个待选择元素的最大大小(3)和3个清除候选元素
group name decision sum count max_size
1 2 aaa winner 3 2 3
2 2 bbb winner 3 1 3
3 2 ccc winner 2 3 3
4 2 ddd loser 2 1 3
aaa
、bbb
和ccc
是前三位的排序方式,先是sum
,然后是count
,而ddd
则不在列
对于第3组,尽管:
group name decision sum count max_size
1 3 aaa winner 3 4 4
2 3 bbb winner 3 2 4
3 3 ccc winner 2 5 4
4 3 ddd unclear 2 1 4
5 3 eee unclear 2 1 4
6 3 fff unclear 2 1 4
aaa
、bbb
、ccc
是前三名,但第四名(假设max_size=4)尚不清楚ddd
、eee
和fff
具有相同的计数和总和
我希望得出一个最终的数据框,将观察结果标记为:
name decision sum count max_size
1 aaa winner 3 2 4
2 bbb winner 3 1 4
3 ccc unclear 2 2 4
4 ddd unclear 2 2 4
5 eee winner 1 0 4
6 aaa winner 3 2 3
7 bbb winner 3 1 3
8 ccc winner 2 3 3
9 ddd loser 2 1 3
10 aaa winner 3 4 4
11 bbb winner 3 2 4
12 ccc winner 2 5 4
13 ddd unclear 2 1 4
14 eee unclear 2 1 4
15 fff unclear 2 1 4
可复制示例:
{'group': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1, 5: 2, 6: 2, 7: 2, 8: 2, 9: 3, 10: 3, 11: 3, 12: 3, 13: 3, 14: 3}, 'name': {0: 'aaa', 1: 'bbb', 2: 'ccc', 3: 'ddd', 4: 'eee', 5: 'aaa', 6: 'bbb', 7: 'ccc', 8: 'ddd', 9: 'aaa', 10: 'bbb', 11: 'ccc', 12: 'ddd', 13: 'eee', 14: 'fff'}, 'decision': {0: 'winner', 1: 'winner', 2: 'unclear', 3: 'unclear', 4: 'winner', 5: 'winner', 6: 'winner', 7: 'winner', 8: 'loser', 9: 'winner', 10: 'winner', 11: 'winner', 12: 'unclear', 13: 'unclear', 14: 'unclear'}, 'sum': {0: 3, 1: 3, 2: 2, 3: 2, 4: 1, 5: 3, 6: 3, 7: 2, 8: 2, 9: 3, 10: 3, 11: 2, 12: 2, 13: 2, 14: 2}, 'count': {0: 2, 1: 1, 2: 2, 3: 2, 4: 0, 5: 2, 6: 1, 7: 3, 8: 1, 9: 4, 10: 2, 11: 5, 12: 1, 13: 1, 14: 1}, 'max_size': {0: 4, 1: 4, 2: 4, 3: 4, 4: 4, 5: 3, 6: 3, 7: 3, 8: 3, 9: 4, 10: 4, 11: 4, 12: 4, 13: 4, 14: 4}}
您可以缩短以下代码,但它应该可以工作:
# sort values
df = df.sort_values(['group', 'sum', 'count'], ascending=[True, False, False])
# duplicated performance columns are candidates for unclear
df['dup'] = df.duplicated(['group', 'sum', 'count'], False)
# set decision column
df['decision'] = 'winner'
# if dup, those are unclear
df.loc[df.dup == True, 'decision'] = 'unclear'
# allocate just a fraction of weight for unclear entries
df['alloc'] = df.loc[df.dup == True].groupby(['group']).decision.transform(lambda x: 1/np.size(x)+1e-6)
# if not dup, then allocate 1
df.loc[df.dup == False, 'alloc'] = 1
# cumulative allocation should add up to compare with max_size
df['cum_alloc'] = df.groupby('group').alloc.cumsum().astype(int)
# decide loser with clear logic
df.loc[df.cum_alloc > df.max_size, 'decision'] = 'loser'
# finally trim columns
df = df[['name', 'decision', 'sum', 'count', 'max_size']]
输出:
>>> df
name decision sum count max_size
1 aaa winner 3 2 4
2 bbb winner 3 1 4
3 ccc unclear 2 2 4
4 ddd unclear 2 2 4
5 eee winner 1 0 4
6 aaa winner 3 2 3
7 bbb winner 3 1 3
8 ccc winner 2 3 3
9 ddd loser 2 1 3
10 aaa winner 3 4 4
11 bbb winner 3 2 4
12 ccc winner 2 5 4
13 ddd unclear 2 1 4
14 eee unclear 2 1 4
15 fff unclear 2 1 4
让我们看第1组,我有5个元素,我必须选择4个(最大元素值)。我首先用总和来排列它们,然后用优先级来排列。aaa和bbb排名前二。ccc和ddd的总和计数值相同-我只选择其中一个,但将它们标记为“pick_random”,然后eee是在它们之后最后一个被选择的值:组的最终大小将是4。我认为根据您的逻辑,您对第一组的输出是不正确的。有4个赢家aaa、bbb、ccc和ddd是明确的赢家,eee是明显的输家不?你只需要从不明确的赢家中选择一个@ALollzYou需要提供输入数据帧,作为可以复制并粘贴到Python中的东西(就像输出一样)。