Python 以最快的方式将一个热编码功能保存到数据帧中_Python_Pandas_Machine Learning_Scikit Learn

Python 以最快的方式将一个热编码功能保存到数据帧中

python pandas machine-learning scikit-learn

Python 以最快的方式将一个热编码功能保存到数据帧中,python,pandas,machine-learning,scikit-learn,Python,Pandas,Machine Learning,Scikit Learn,我有一个包含所有功能和标签的熊猫数据框。我的一个特性是分类的，需要进行热编码该功能是一个整数，只能具有0到4之间的值为了将这些数组保存回我的数据帧中，我使用以下代码 # enc is my OneHotEncoder object df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray()) 我的DataFrame有100多万行，因此上面的代码需要一些时间。有没有更快的方法将数组分配给DataFrame单元格？因为我

我有一个包含所有功能和标签的熊猫数据框。我的一个特性是分类的，需要进行热编码

该功能是一个整数，只能具有0到4之间的值

为了将这些数组保存回我的数据帧中，我使用以下代码

# enc is my OneHotEncoder object
df['mycol'] = df['mycol'].map(lambda x: enc.transform(x).toarray())

我的DataFrame有100多万行，因此上面的代码需要一些时间。有没有更快的方法将数组分配给DataFrame单元格？因为我只有5个类别，所以我不需要调用

transform（）

函数100万次

我已经试过了

num_categories = 5
i = 0
while (i<num_categories):
    df.loc[df['mycol'] == i, 'mycol'] = enc.transform(i).toarray()
    i += 1

您可以使用：

或者：

>>> from sklearn.preprocessing import OneHotEncoder

>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
       [1],
       [3],
       [2],
       [2]]

>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.]])

尝试

df['mycol']=pd.factorize（df['mycol']）[0]

get_dummies（）

与我的解决方案相比，速度非常快，还可以帮助我完成代码中的下一步（将每个元素保存到新列中）。谢谢你的帮助！还可以考虑使用OnHooToNo编码器，因为您可以在SCIKIT学习管道中使用它，因此不需要额外的合并步骤。

>>> s
0    a
1    b
2    c
3    a
dtype: object

>>> pd.get_dummies(s)
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0

>>> from sklearn.preprocessing import OneHotEncoder

>>> enc = OneHotEncoder()
>>> a = np.array([1, 1, 3, 2, 2]).reshape(-1, 1)
>>> a
array([[1],
       [1],
       [3],
       [2],
       [2]]

>>> one_hot = enc.fit_transform(a)
>>> one_hot.toarray()
array([[ 1.,  0.,  0.],
       [ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  1.,  0.]])