Python 如何根据概率分布将从列表中选择的值分配给df列?
我有一个数据集,其中包含许多行,这些行是来自某个国家的个人,具有输入类别(1,2)。 每个唯一行出现5次(同一行出现5次,下一行出现5次,以此类推)。 我想做的是在df中创建一个新列(比如输出),并根据条件分布为它分配另一个值(也是1或2)Python 如何根据概率分布将从列表中选择的值分配给df列?,python,pandas,dataframe,probability-distribution,Python,Pandas,Dataframe,Probability Distribution,我有一个数据集,其中包含许多行,这些行是来自某个国家的个人,具有输入类别(1,2)。 每个唯一行出现5次(同一行出现5次,下一行出现5次,以此类推)。 我想做的是在df中创建一个新列(比如输出),并根据条件分布为它分配另一个值(也是1或2) Country Input category Output category 0 Algeria 1 0 1 Algeria 1
Country Input category Output category
0 Algeria 1 0
1 Algeria 1 0
2 Algeria 1 0
3 Algeria 1 0
4 Algeria 1 0
5 Algeria 2 0
6 Algeria 2 0
7 Algeria 2 0
8 Algeria 2 0
9 Algeria 2 0
10 France 1 0
11 France 1 0
12 France 1 0
13 France 1 0
14 France 1 0
15 France 2 0
16 France 2 0
17 France 2 0
18 France 2 0
19 France 2 0
20 Italy 1 0
21 Italy 1 0
22 Italy 1 0
23 Italy 1 0
24 Italy 1 0
25 Italy 2 0
26 Italy 2 0
27 Italy 2 0
28 Italy 2 0
29 Italy 2 0
例如,因为对于阿尔及利亚p1_1(p of Output=1,input=1)=2/5,所以我想将输出1分配给我的行中的2(从而将输出2分配给剩余的3行)
已编辑:以下是预期输出:
Country p1_1 p2_1 p1_2 p2_2
0 Algeria 0.4 0.6 0.4 0.6
1 France 0.2 0.8 0.6 0.4
2 Italy 0.2 0.8 1.0 0.0
IIUC
如果需要按
输入类别进行排序
:
print(df)
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 2
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 2
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
IIUC
如果需要按
输入类别进行排序
:
print(df)
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 2
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 2
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
你能显示预期输出吗?@ansev just Done你能检查我的解决方案吗?你能显示预期输出吗?@ansev just Done你能检查我的解决方案吗?谢谢,非常有用!谢谢,非常有用!
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 1
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 1
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
n=5
#
#s = df['Country'].value_counts()
#assert s.nunique() == 1
#n = s.iloc[0] // df['Input category'].nunique()
#print(n)
##5
df = df.sort_values(['Country', 'Input category']).reset_index(drop=True)
df2 = cond_prob.melt('Country').sort_values(['Country'])
df['Output Category'] = (df2.reindex(df2.index.repeat(df2['value'].mul(n)))['variable']
.str.extract('(\d+)')[0].values.astype(int))
print(df)
Country Input category Output category
0 Algeria 1 1
1 Algeria 1 1
2 Algeria 1 2
3 Algeria 1 2
4 Algeria 1 2
5 Algeria 2 1
6 Algeria 2 1
7 Algeria 2 2
8 Algeria 2 2
9 Algeria 2 2
10 France 1 1
11 France 1 2
12 France 1 2
13 France 1 2
14 France 1 2
15 France 2 1
16 France 2 1
17 France 2 1
18 France 2 2
19 France 2 2
20 Italy 1 1
21 Italy 1 2
22 Italy 1 2
23 Italy 1 2
24 Italy 1 2
25 Italy 2 1
26 Italy 2 1
27 Italy 2 1
28 Italy 2 1
29 Italy 2 1
df2 = cond_prob.melt('Country').sort_values('Country')
df2 = df2.reindex(df2.index.repeat(df2['value'].mul(5)))
values = (df2.assign(**df2['variable'].str.split('_', expand=True)
.set_axis(['Output category', 'Input category'],
axis=1))
.sort_values(['Country', 'Input category']))['Output category'].str.extract('(\d+)').values
df['Output category'] = values
print(df)