Python 如何根据概率分布将从列表中选择的值分配给df列?

Python 如何根据概率分布将从列表中选择的值分配给df列?,python,pandas,dataframe,probability-distribution,Python,Pandas,Dataframe,Probability Distribution,我有一个数据集,其中包含许多行,这些行是来自某个国家的个人,具有输入类别(1,2)。 每个唯一行出现5次(同一行出现5次,下一行出现5次,以此类推)。 我想做的是在df中创建一个新列(比如输出),并根据条件分布为它分配另一个值(也是1或2) Country Input category Output category 0 Algeria 1 0 1 Algeria 1

我有一个数据集,其中包含许多行,这些行是来自某个国家的个人,具有输入类别(1,2)。 每个唯一行出现5次(同一行出现5次,下一行出现5次,以此类推)。 我想做的是在df中创建一个新列(比如输出),并根据条件分布为它分配另一个值(也是1或2)

    Country  Input category  Output category
0   Algeria               1                0
1   Algeria               1                0
2   Algeria               1                0
3   Algeria               1                0
4   Algeria               1                0
5   Algeria               2                0
6   Algeria               2                0
7   Algeria               2                0
8   Algeria               2                0
9   Algeria               2                0
10   France               1                0
11   France               1                0
12   France               1                0
13   France               1                0
14   France               1                0
15   France               2                0
16   France               2                0
17   France               2                0
18   France               2                0
19   France               2                0
20    Italy               1                0
21    Italy               1                0
22    Italy               1                0
23    Italy               1                0
24    Italy               1                0
25    Italy               2                0
26    Italy               2                0
27    Italy               2                0
28    Italy               2                0
29    Italy               2                0
例如,因为对于阿尔及利亚p1_1(p of Output=1,input=1)=2/5,所以我想将输出1分配给我的行中的2(从而将输出2分配给剩余的3行)

已编辑:以下是预期输出:

   Country  p1_1  p2_1  p1_2  p2_2
0  Algeria   0.4   0.6   0.4   0.6
1   France   0.2   0.8   0.6   0.4
2    Italy   0.2   0.8   1.0   0.0
IIUC




如果需要按
输入类别进行排序

print(df)
    Country  Input category  Output category
0   Algeria               1                1
1   Algeria               1                1
2   Algeria               1                2
3   Algeria               1                2
4   Algeria               1                2
5   Algeria               2                1
6   Algeria               2                1
7   Algeria               2                2
8   Algeria               2                2
9   Algeria               2                2
10   France               1                1
11   France               1                2
12   France               1                2
13   France               1                2
14   France               1                2
15   France               2                1
16   France               2                1
17   France               2                1
18   France               2                2
19   France               2                2
20    Italy               1                1
21    Italy               1                2
22    Italy               1                2
23    Italy               1                2
24    Italy               1                2
25    Italy               2                1
26    Italy               2                1
27    Italy               2                1
28    Italy               2                1
29    Italy               2                1
IIUC




如果需要按
输入类别进行排序

print(df)
    Country  Input category  Output category
0   Algeria               1                1
1   Algeria               1                1
2   Algeria               1                2
3   Algeria               1                2
4   Algeria               1                2
5   Algeria               2                1
6   Algeria               2                1
7   Algeria               2                2
8   Algeria               2                2
9   Algeria               2                2
10   France               1                1
11   France               1                2
12   France               1                2
13   France               1                2
14   France               1                2
15   France               2                1
16   France               2                1
17   France               2                1
18   France               2                2
19   France               2                2
20    Italy               1                1
21    Italy               1                2
22    Italy               1                2
23    Italy               1                2
24    Italy               1                2
25    Italy               2                1
26    Italy               2                1
27    Italy               2                1
28    Italy               2                1
29    Italy               2                1

你能显示预期输出吗?@ansev just Done你能检查我的解决方案吗?你能显示预期输出吗?@ansev just Done你能检查我的解决方案吗?谢谢,非常有用!谢谢,非常有用!

Country  Input category  Output category
0   Algeria               1                1
1   Algeria               1                1
2   Algeria               1                1
3   Algeria               1                2
4   Algeria               1                2
5   Algeria               2                1
6   Algeria               2                1
7   Algeria               2                1
8   Algeria               2                2
9   Algeria               2                2
10   France               1                1
11   France               1                2
12   France               1                2
13   France               1                2
14   France               1                2
15   France               2                1
16   France               2                1
17   France               2                1
18   France               2                2
19   France               2                2
20    Italy               1                1
21    Italy               1                2
22    Italy               1                2
23    Italy               1                2
24    Italy               1                2
25    Italy               2                1
26    Italy               2                1
27    Italy               2                1
28    Italy               2                1
29    Italy               2                1

n=5
#
#s = df['Country'].value_counts()
#assert s.nunique() == 1
#n = s.iloc[0] // df['Input category'].nunique()
#print(n)
##5
df = df.sort_values(['Country', 'Input category']).reset_index(drop=True)
df2 = cond_prob.melt('Country').sort_values(['Country'])
df['Output Category'] = (df2.reindex(df2.index.repeat(df2['value'].mul(n)))['variable']
                            .str.extract('(\d+)')[0].values.astype(int))
print(df)
    Country  Input category  Output category
0   Algeria               1                1
1   Algeria               1                1
2   Algeria               1                2
3   Algeria               1                2
4   Algeria               1                2
5   Algeria               2                1
6   Algeria               2                1
7   Algeria               2                2
8   Algeria               2                2
9   Algeria               2                2
10   France               1                1
11   France               1                2
12   France               1                2
13   France               1                2
14   France               1                2
15   France               2                1
16   France               2                1
17   France               2                1
18   France               2                2
19   France               2                2
20    Italy               1                1
21    Italy               1                2
22    Italy               1                2
23    Italy               1                2
24    Italy               1                2
25    Italy               2                1
26    Italy               2                1
27    Italy               2                1
28    Italy               2                1
29    Italy               2                1
df2 = cond_prob.melt('Country').sort_values('Country')
df2 = df2.reindex(df2.index.repeat(df2['value'].mul(5)))
values = (df2.assign(**df2['variable'].str.split('_', expand=True)
                                      .set_axis(['Output category', 'Input category'],
                                                axis=1))
             .sort_values(['Country', 'Input category']))['Output category'].str.extract('(\d+)').values
df['Output category'] = values
print(df)