Python 3.x 从数据子集的列中创建虚拟对象，这样做'；t包含该列中的所有类别值_Python 3.x_Pandas_One Hot Encoding

Python 3.x 从数据子集的列中创建虚拟对象，这样做'；t包含该列中的所有类别值

python-3.x pandas

Python 3.x 从数据子集的列中创建虚拟对象，这样做'；t包含该列中的所有类别值,python-3.x,pandas,one-hot-encoding,Python 3.x,Pandas,One Hot Encoding,我正在处理一个大数据集的子集数据框中有一个名为“type”的列。“类型”的值应类似于[1,2,3,4] 在某个子集中，我发现“type”列只包含某些值，如[1,4]，如 In [1]: df Out[2]: type 0 1 1 4 当我从该子集上的“type”列创建假人时，结果如下： In [3]:import pandas as pd In [4]:pd.get_dummies(df["type"], prefix = "typ

我正在处理一个大数据集的子集

数据框中有一个名为“type”的列。“类型”的值应类似于[1,2,3,4]

在某个子集中，我发现“type”列只包含某些值，如[1,4]，如

 In [1]: df
 Out[2]:
          type
    0      1
    1      4

当我从该子集上的“type”列创建假人时，结果如下：

In [3]:import pandas as pd
In [4]:pd.get_dummies(df["type"], prefix = "type")
Out[5]:        type_1 type_4
        0        1       0
        1        0       1

它没有名为“type_2”、“type_3”的列。我想要的是：

 Out[6]:        type_1 type_2 type_3 type_4
            0      1      0       0      0
            1      0      0       0      1

有解决方案吗？

您需要做的是将列

'type'

设置为一个，并指定

类别

pd.get_dummies(pd.Categorical(df.type, [1, 2, 3, 4]), prefix='type')

   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

另一个解决方案包括和：

或解决方案：

df1 = pd.get_dummies(df["type"].astype('category', categories=[1, 2, 3, 4]), prefix='type')
print (df1)
   type_1  type_2  type_3  type_4
0       1       0       0       0
1       0       0       0       1

由于您将帖子标记为“一个热编码”，因此除了纯熊猫解决方案外，您可能会发现

sklearn

模块也很有用：

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5

# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))

# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])

print(newdf)

   type_0  type_1  type_2  type_3  type_4
0       0       1       0       0       0
1       0       0       0       0       1

使用这种方法的一个优点是，

OneHotEncoder

可以为非常大的类集轻松生成稀疏向量。（只需在

onehotcoder（）

声明中更改为

sparse=True

。

很高兴能为您提供帮助。天气真好！

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# sample data
df = pd.DataFrame({'type':[1,4]})
n_vals = 5

# one-hot encoding
encoder = OneHotEncoder(n_values=n_vals, sparse=False, dtype=int)
data = encoder.fit_transform(df.type.values.reshape(-1,1))

# encoded data frame
newdf = pd.DataFrame(data, columns=['type_{}'.format(x) for x in range(n_vals)])

print(newdf)

   type_0  type_1  type_2  type_3  type_4
0       0       1       0       0       0
1       0       0       0       0       1