Python 如何在一个单元格中使用多个值进行一次热编码？_Python_One Hot Encoding

Python 如何在一个单元格中使用多个值进行一次热编码？

python

Python 如何在一个单元格中使用多个值进行一次热编码？,python,one-hot-encoding,Python,One Hot Encoding,我在Excel中有以下表格： id class 0 2 3 1 1 3 2 3 5 现在，我想用Python做一个“特殊”的热编码。对于第一个表中的每个id，有两个数字。每个数字对应一个类别（类别1、类别2等）。第二个表是基于第一个表创建的，这样对于每个id，其行中的每个数字都显示在其相应的类列中，而其他列只得到零。例如，id 0的数字是2和3。2被放在2级，3被放在3级。类1、4和5的默认值为0。结果应该是： id class1 class2 class3 clas

我在Excel中有以下表格：

id  class
0   2 3
1   1 3 
2   3 5

现在，我想用Python做一个“特殊”的热编码。对于第一个表中的每个id，有两个数字。每个数字对应一个类别（类别1、类别2等）。第二个表是基于第一个表创建的，这样对于每个id，其行中的每个数字都显示在其相应的类列中，而其他列只得到零。例如，id 0的数字是2和3。2被放在2级，3被放在3级。类1、4和5的默认值为0。结果应该是：

id  class1  class2  class3  class4  class5
 0   0       2        3       0       0
 1   1       0        3       0       0
 2   0       0        3       0       5

我以前的解决方案

foo = lambda x: pd.Series([i for i in x.split()])
result=onehot['hotel'].apply(foo)
result.columns=['class1','class2']
pd.get_dummies(result, prefix='class', columns=['class1','class2'])

结果：

    class_1 class_2 class_3 class_3 class_5
  0  0.0     1.0    0.0      1.0    0.0
  1  1.0     0.0    0.0      1.0    0.0
  2  0.0     0.0    1.0      0.0    1.0

（类_3出现两次）。我能做些什么来解决这个问题？（完成此步骤后，我可以将其转换为我想要的最终格式。）

这是否满足您所述的问题

#!/usr/bin/python

input = [
    (0, (2,3)),
    (1, (1,3)),
    (2, (3,5)),
]

maximum = max(reduce(lambda x, y: x+list(y[1]), input, []))
# Or ...
# maximum = 0
# for i, classes in input:
#    maximum = max(maximum, *classes)

# print header.
print "\t".join(["id"] + ["class_%d" % i for i in range(1, 6)])

for i, classes in input:
    print i,
    for r in range(1, maximum+1):
        print "\t",
        if r in classes:
            print float(r),
        else:
            print 0.0,
    print

输出：

id      class_1 class_2 class_3 class_4 class_5
0       0.0     2.0     3.0     0.0     0.0
1       1.0     0.0     3.0     0.0     0.0
2       0.0     0.0     3.0     0.0     5.0

将原始数据框拆分为3列可能更简单：

id  class_a class_b
0   2          3
1   1          3  
2   3          5

然后对其执行正常的热编码。之后，您可能会得到重复的列，如：

id  ... class_a_3 class_b_3 ... class_b_5
0          0          1             0
1          0          1             0
2          1          0             0

但你可以简单地在事后将它们合并/求和

同样，您也可以使用相同的逻辑，将df转换为以下形式：

然后是一个热的，并使用键id上的sum进行聚合。

您需要将变量设置为，然后可以使用如下所示：

In [18]: df1 = pd.DataFrame({"class":pd.Series(['2','1','3']).astype('category',categories=['1','2','3','4','5'])})

In [19]: df2 = pd.DataFrame({"class":pd.Series(['3','3','5']).astype('category',categories=['1','2','3','4','5'])})

In [20]: df_1 = pd.get_dummies(df1)

In [21]: df_2 = pd.get_dummies(df2)

In [22]: df_1.add(df_2).apply(lambda x: x * [i for i in range(1,len(df_1.columns)+1)], axis = 1).astype(int).rename_axis('id')
Out[22]: 
    class_1  class_2  class_3  class_4  class_5
id                                             
0         0        2        3        0        0
1         1        0        3        0        0
2         0        0        3        0        5

@凤梨只是抬起头来。像您所做的那样存储

lambda

函数会破坏它们的用途。您可以只使用

def

。