Python 如何在Sklearn中执行OneHotEncoding，获取值错误_Python_Scikit Learn_Preprocessor_Sklearn Pandas_One Hot Encoding

Python 如何在Sklearn中执行OneHotEncoding，获取值错误

python scikit-learn

Python 如何在Sklearn中执行OneHotEncoding，获取值错误,python,scikit-learn,preprocessor,sklearn-pandas,one-hot-encoding,Python,Scikit Learn,Preprocessor,Sklearn Pandas,One Hot Encoding,我刚开始学习机器学习，当练习其中一项任务时，我得到了值错误，但我遵循了与讲师相同的步骤我得到值错误，请帮助 dff 首先我做了标签编码 X=dff.values label_encoder=LabelEncoder() X[:,0]=label_encoder.fit_transform(X[:,0]) out: X array([[0, 'Sri'], [2, 'Vignesh'], [1, 'Pechi'], [2, 'Raj']], dtype

我刚开始学习机器学习，当练习其中一项任务时，我得到了值错误，但我遵循了与讲师相同的步骤

我得到值错误，请帮助

dff

首先我做了标签编码

X=dff.values
label_encoder=LabelEncoder()
X[:,0]=label_encoder.fit_transform(X[:,0])

out:
X
array([[0, 'Sri'],
       [2, 'Vignesh'],
       [1, 'Pechi'],
       [2, 'Raj']], dtype=object)

然后对同一个X执行一次热编码

onehotencoder=OneHotEncoder( categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()

我得到以下错误：

ValueError                                Traceback (most recent call last)
<ipython-input-472-be8c3472db63> in <module>()
----> 1 X=onehotencoder.fit_transform(X).toarray()

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in fit_transform(self, X, y)
   1900         """
   1901         return _transform_selected(X, self._fit_transform,
-> 1902                                    self.categorical_features, copy=True)
   1903 
   1904     def _transform(self, X):

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\data.py in _transform_selected(X, transform, selected, copy)
   1695     X : array or sparse matrix, shape=(n_samples, n_features_new)
   1696     """
-> 1697     X = check_array(X, accept_sparse='csc', copy=copy, dtype=FLOAT_DTYPES)
   1698 
   1699     if isinstance(selected, six.string_types) and selected == "all":

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    380                                       force_all_finite)
    381     else:
--> 382         array = np.array(array, dtype=dtype, order=order, copy=copy)
    383 
    384         if ensure_2d:

ValueError: could not convert string to float: 'Raj'

ValueError回溯（最近一次调用）
在（）
---->1 X=onehotcoder.fit_transform（X）.toarray（）
C:\ProgramData\Anaconda3\lib\site packages\sklearn\preprocessing\data.py in fit_transform（self，X，y）
1900         """
1901返回所选的变换（X，自适配变换，
->1902 self.categorical_特征，copy=True）
1903
1904 def_变换（self，X）：
C:\ProgramData\Anaconda3\lib\site packages\sklearn\preprocessing\data.py in\u transform\u selected（X，transform，selected，copy）
1695 X：数组或稀疏矩阵，形状=（n_样本，n_特征\u新）
1696     """
->1697 X=检查数组（X，接受稀疏=csc'，复制=复制，数据类型=浮动\U数据类型）
1698
1699如果isinstance（已选择，六种.string_类型）和selected==“all”：
检查数组中的C:\ProgramData\Anaconda3\lib\site packages\sklearn\utils\validation.py（数组、接受稀疏、数据类型、顺序、复制、强制所有有限、确保2d、允许nd、确保最小样本、确保最小特征、警告数据类型、估算器）
380力（全部有限）
381其他：
-->382 array=np.array（array，dtype=dtype，order=order，copy=copy）
383
384如果确保\u 2d：
ValueError:无法将字符串转换为浮点：“Raj”

请编辑我的问题是什么错误，提前感谢

下面的实现应该很好。请注意，onehotencoder的输入

fit\u transform

不能是1秩数组，而且输出是稀疏的，我们使用了

to\u array（）

来扩展它

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]


df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

le = LabelEncoder()
X_num = le.fit_transform(X[:,0]).reshape(-1,1)

ohe = OneHotEncoder()
X_num = ohe.fit_transform(X_num)

print (X_num.toarray())

X[:,0] = X_num

print (X)

如果您确实希望编码多个分类特征，另一种方法是使用带有FeatureUnion和两个自定义转换器的管道

首先需要两个转换器-一个用于选择单个列，另一个用于使LabelEncoder在管道中可用（fit_transform方法只需要X，在管道中工作需要可选的y）

接下来创建一个管道（或者只是一个FeatureUnion），它有两个分支——每个类别列一个分支。在每个select 1列中，对标签进行编码，然后进行一个热编码

import pandas as pd
import numpy as np

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline, make_pipeline, FeatureUnion

pipeline = Pipeline([(
    'encoded_features',
    FeatureUnion([('countries',
        make_pipeline(
            SingleColumnSelector(0),
            PipelineAwareLabelEncoder(),
            OneHotEncoder()
        )), 
        ('names', make_pipeline(
            SingleColumnSelector(1),
            PipelineAwareLabelEncoder(),
            OneHotEncoder()
        ))
    ]))
])

最后通过管道运行完整的数据帧-它将对每一列分别进行一次热编码，并在最后连接

df = pd.DataFrame([["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]], columns=['Country', 'Name'])
X = df.values
transformed_X = pipeline.fit_transform(X)
print(transformed_X.toarray())

返回（前3列为国家，后4列为名称）

您可以直接转到OneHotEncoding now，而不必使用LabelEncoder，并且随着版本0.22的发展，许多人可能希望通过这种方式避免警告和潜在错误（请参阅和）

示例代码1，其中所有列都进行了编码，并且明确指定了类别：

代码示例1的输出：

示例代码2显示类别规格的“自动”选项：前3列编码国家名称，后4列编码个人名称

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()

print (X)

代码示例2的输出（与示例1相同）：

示例代码3，其中只有第一列是热编码的：现在，这里是独特的部分。如果您只需要对数据的特定列进行一次热编码，该怎么办

（注意：为了便于说明，我将最后一列保留为字符串。实际上，当最后一列已经是数字时，这样做更有意义）

代码示例3的输出：

长话短说，如果您希望模拟您的df，请使用

dummy=pd.get_dummies

作为：

dummy=pd.get_dummies(df['str'])
df=pd.concat([df,dummy], axis=1)
print(Data)

为什么不将

'Name'

列更改为数字，就像对

'Country'

所做的那样。OneHotEncoder仅处理数字X。因此，在发送到OneHotEncoder之前，请将其从X中删除，或者转换为数字。我只传递了一行

X[：，0]=OneHotEncoder.fit_transform（X[：，0]）。toarray（）

但仍然

\sklearn\utils\validation.py:395:deprecation警告：在0.17中，传递1d数组作为数据是不推荐的，并且将在0.19中引发ValueError。如果数据具有单个特征，请使用X.restrape（-1，1），如果数据包含单个样本，请使用X.restrape（1，-1）重塑数据。不推荐使用警告）

是的，这是因为您正在将rank1数组，即

X[：，0]

传递给

onehotencoder.fit\u transform

，这是不推荐使用的。因此，您需要通过执行

X[：，0]来重塑它。重塑（-1,1）

或使用

np.newaxis

@aruneshingh来重塑它。谢谢您，可以用我的数据发布您的答案吗？我尝试了重塑，得到了

DataConversionWarning：当需要1d数组时，传递了一个列向量y。请将y的形状更改为（n_samples，），例如使用ravel（）。y=column_或_1d（y，warn=True）

我的输出是

数组（[[1.0,2]，[2.0,3]，[1.0,0]，[2.0,1]]，dtype=object）

应该是1或0，对吗？这些只是警告，所以不会干扰结果。工作正常，如何以数据帧格式显示X？从早上（过去几个小时）起，我就一直在做同样的事情。我一直在试图找出一个类似于上面例子3的方法。你是怎么理解这两行的。tmp=ohe.fit_transform（X[：，0]。重塑（-1，1））.toarray（）X=np.append（tmp，names.restrape（-1，1），axis=1）我已经阅读了scikit学习文档好几次，仍然不是很清楚。非常感谢你的帮助answer@michael，我刚才在这里看到你的问题-对不起。你需要一个numpy和sklearn的组合，但更多的是numpy。如果仍然没有帮助，请告诉我。

[[ 1.  0.  0.  0.  0.  1.  0.]
 [ 0.  0.  1.  0.  0.  0.  1.]
 [ 0.  1.  0.  1.  0.  0.  0.]
 [ 0.  0.  1.  0.  1.  0.  0.]]

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries, names])
X = ohe.fit_transform(X).toarray()

print (X)

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

ohe = OneHotEncoder(categories='auto')
X = ohe.fit_transform(X).toarray()

print (X)

[[1. 0. 0. 0. 0. 1. 0.]
 [0. 0. 1. 0. 0. 0. 1.]
 [0. 1. 0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 1. 0. 0.]]

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

data= [["AUS", "Sri"],["USA","Vignesh"],["IND", "Pechi"],["USA","Raj"]]

df = pd.DataFrame(data, columns=['Country', 'Name'])
X = df.values

countries = np.unique(X[:,0])
names = np.unique(X[:,1])

ohe = OneHotEncoder(categories=[countries]) # specify ONLY unique country names
tmp = ohe.fit_transform(X[:,0].reshape(-1, 1)).toarray()

X = np.append(tmp, names.reshape(-1,1), axis=1)

print (X)

[[1.0 0.0 0.0 'Pechi']
 [0.0 0.0 1.0 'Raj']
 [0.0 1.0 0.0 'Sri']
 [0.0 0.0 1.0 'Vignesh']]

dummy=pd.get_dummies(df['str'])
df=pd.concat([df,dummy], axis=1)
print(Data)