Scikit learn 分类功能的OneHotEncoder存在问题_Scikit Learn_Feature Extraction_Categorical Data

Scikit learn 分类功能的OneHotEncoder存在问题

scikit-learn

Scikit learn 分类功能的OneHotEncoder存在问题,scikit-learn,feature-extraction,categorical-data,Scikit Learn,Feature Extraction,Categorical Data,我想对我的数据集中10个特征中的3个分类特征进行编码。我使用来自的预处理执行以下操作： from sklearn import preprocessing cat_features = ['color', 'director_name', 'actor_2_name'] enc = preprocessing.OneHotEncoder(categorical_features=cat_features) enc.fit(dataset.values) 但是，我无法继续，因为我遇到以下错误：

我想对我的数据集中10个特征中的3个分类特征进行编码。我使用来自的

预处理

执行以下操作：

from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

但是，我无法继续，因为我遇到以下错误：

    array = np.array(array, dtype=dtype, order=order, copy=copy)
ValueError: could not convert string to float: PG

我很惊讶为什么它会抱怨字符串，因为它应该转换它！！我在这里遗漏了什么吗？

来自文档：

categorical_features : “all” or array of indices or mask
Specify what features are treated as categorical.
‘all’ (default): All features are treated as categorical.
array of indices: Array of categorical feature indices.
mask: Array of length n_features and with dtype=bool.

熊猫数据框的列名不起作用。如果分类要素为列号0、2和6，请使用：

from sklearn import preprocessing
cat_features = [0, 2, 6]
enc = preprocessing.OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

还必须注意的是，如果这些分类功能不是标签编码的，则在使用

OneHotEncoder

之前需要对这些功能使用

LabelEncoder

，

如果您阅读OneHotEncoder
的文档，您将看到fit
的输入是“int型输入数组”。因此，您需要为一个热编码数据执行两个步骤
from sklearn import preprocessing
cat_features = ['color', 'director_name', 'actor_2_name']
enc = preprocessing.LabelEncoder()
enc.fit(cat_features)
new_cat_features = enc.transform(cat_features)
print new_cat_features # [1 2 0]
new_cat_features = new_cat_features.reshape(-1, 1) # Needs to be the correct shape
ohe = preprocessing.OneHotEncoder(sparse=False) #Easier to read
print ohe.fit_transform(new_cat_features)

输出：
[[ 0.  1.  0.]
 [ 0.  0.  1.]
 [ 1.  0.  0.]]

编辑
从0.20
开始，这变得更容易了，这不仅是因为OneHotEncoder
现在可以很好地处理字符串，还因为我们可以使用ColumnTransformer
轻松地转换多个列，请参见下面的示例
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
import numpy as np

X = np.array([['apple', 'red', 1, 'round', 0],
              ['orange', 'orange', 2, 'round', 0.1],
              ['bannana', 'yellow', 2, 'long', 0],
              ['apple', 'green', 1, 'round', 0.2]])
ct = ColumnTransformer(
    [('oh_enc', OneHotEncoder(sparse=False), [0, 1, 3]),],  # the column numbers I want to apply this to
    remainder='passthrough'  # This leaves the rest of my columns in place
)
print(ct2.fit_transform(X)) # Notice the output is a string

输出：
[['1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '0.0' '0.0' '1.0' '1' '0']
 ['0.0' '0.0' '1.0' '0.0' '1.0' '0.0' '0.0' '0.0' '1.0' '2' '0.1']
 ['0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1.0' '0.0' '2' '0']
 ['1.0' '0.0' '0.0' '1.0' '0.0' '0.0' '0.0' '0.0' '1.0' '1' '0.2']]

您可以应用这两种转换（从文本类别到整数类别，然后从整数类别）
使用LabelBinarizer类一次拍摄到一个热向量：
cat_features = ['color', 'director_name', 'actor_2_name']
encoder = LabelBinarizer()
new_cat_features = encoder.fit_transform(cat_features)
new_cat_features

请注意，默认情况下，这将返回密集的NumPy数组。您可以通过传递来获得稀疏矩阵
稀疏_输出=LabelBinarizer构造函数的True
源
如果数据集位于熊猫数据框中，则使用
熊猫，去拿傻瓜
会更直接
*更正自pandas。get_get dummies to pandas。get_dummies
@Medo
我遇到了同样的行为，觉得很沮丧。正如其他人所指出的，Scikit Learn要求所有数据都是数字，然后才考虑选择category\u features
参数中提供的列
具体来说，列选择由/sklearn/preprocessing/data.py中的\u transform\u selected（）
方法处理，该方法的第一行是
X=check\u数组（X，accept\u sparse='csc'，copy=copy，dtype=FLOAT\u DTYPES）

如果提供的数据帧X
中的任何数据无法成功转换为浮点值，则此检查将失败
我同意sklearn.preprocessing.OneHotEncoder的文档在这方面具有误导性
 如果你像我一样对此感到沮丧，有一个简单的解决办法。简单使用。这是一个Sklearn Contrib包，因此与scikit学习API配合得非常好
这是一个直接的替代品，可以为您完成枯燥的标签编码
from category_encoders import OneHotEncoder
cat_features = ['color', 'director_name', 'actor_2_name']
enc = OneHotEncoder(categorical_features=cat_features)
enc.fit(dataset.values)

对@piman314答案的评论（声誉不足，无法发表评论）
这个问题只发生在sklearn版本，我根本不理解这个答案。您在哪里为编码器安装数据集中的数据？你能提供一个更详细的例子来说明问题中的数据集吗？你是如何在一条管线中做到这一点的？老实说，变量的命名是令人困惑的。cat_features不是数据集中的分类特征列表，而是数据集本身，具有1列分类特征。LabelEncoder一次编码一个分类变量关于编辑：使用熊猫数据帧允许混合类型输出X=pd.数据帧（[[apple'，red'，1'，round'，0]，…
带有ct=ColumnTransformer（[[oh_enc'，OneHotEncoder（sparse=False），[0，1]），…
预导混合输出：[[1.0.0.0.0.0.0 1.0 0 1'round'0.0]…
是的，这太容易了！有了get_假人，我仍然在努力在测试数据集和训练数据集之间获得一致的OHE，而不是首先合并它们