在Python中重新编码分类变量_Python_Scikit Learn_Feature Extraction_Categorical Data

在Python中重新编码分类变量

python scikit-learn

在Python中重新编码分类变量,python,scikit-learn,feature-extraction,categorical-data,Python,Scikit Learn,Feature Extraction,Categorical Data,我一直在尝试使用Anaconda发行版学习Python 3.6。我正在使用的在线课程的内容遇到了问题，可能需要一些帮助来处理一些错误消息。我当然会问老师，但他们似乎对学生提出的问题反应不太积极我在使用三个主要类重新编码分类数据时遇到了一些问题。据我所知，从scikitlearn包中提取了三个类用于记录变量：LabelEncoder、OneHotEncoder和LabelBinarizer。我曾尝试使用每种方法在数据集中重新编码一个分类变量，但每次都会出错请原谅我对样品代码的不了解。正如人们可

我一直在尝试使用Anaconda发行版学习Python 3.6。我正在使用的在线课程的内容遇到了问题，可能需要一些帮助来处理一些错误消息。我当然会问老师，但他们似乎对学生提出的问题反应不太积极

我在使用三个主要类重新编码分类数据时遇到了一些问题。据我所知，从scikitlearn包中提取了三个类用于记录变量：LabelEncoder、OneHotEncoder和LabelBinarizer。我曾尝试使用每种方法在数据集中重新编码一个分类变量，但每次都会出错

请原谅我对样品代码的不了解。正如人们可能已经猜到的那样，我的问题是卑鄙的，我并不精通python

对象X包含几列，第一列是我需要转换的分类字符串（如果有人也能告诉我如何插入表，那会很有帮助。我必须使用HTML吗？）

“鱼”15 3
“狗”26 9
“狗”8
“猫”576
“猫”6

标签编码器尝试

下面是我试图实现的代码，以及我收到的对象X的错误消息，该对象的属性与我上面描述的大致相同

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

TypeError: fit_transform() missing 1 required positional argument: 'y'

让我感到困惑的是，我认为上面的代码清楚地定义了y是什么，X的第一列

OneHotEncoder

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

TypeError: 'method' object is not subscriptable

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

标签二值化器

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

TypeError: 'method' object is not subscriptable

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

我发现这是最难理解的一个，实际上无法根据数据集的结构进行尝试

你所能提供的任何指导或建议都会有无穷的帮助。

让我们一步一步来

首先加载在名为X的numpy数组中显示的数据

import numpy as np
X = np.array([["Fish", 1, 5, 3],
              ["Dog",  2, 6, 9],
              ["Dog",  8, 8, 8],
              ["Cat",  5, 7, 6],
              ["Cat",  6, 6, 6]])

现在试试你的密码

1）标签编码

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

TypeError: 'method' object is not subscriptable

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

这里的错误之处在于，您使用类

LabelEncoder

作为对象，在其上调用

fit\u transform

。因此，通过以下方式更正：

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder()
X[:, 0] = labelencoder_X.fit_transform(X[:, 0])

参见上文第2行和第3行中的更改。首先，我通过调用

labelencoder（）

创建了

labelencoder

类的对象

labelencoder\ux

，然后使用

labelencoder\ux.fit\u transform（）

使用该对象调用fit\u transform（）。那么该代码不会给出任何错误，新的X是：

Output:
array([['2', '1', '5', '3'],
       ['1', '2', '6', '9'],
       ['1', '8', '8', '8'],
       ['0', '5', '7', '6'],
       ['0', '6', '6', '6']], dtype='|S4')

请确保第一列已成功更改

2）OneHotEncoder

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

TypeError: 'method' object is not subscriptable

from sklearn.preprocessing import LabelEncoder
labelencoder_X =LabelEncoder 
X[:, 0] = LabelEncoder.fit_transform(X[:, 0])

您的代码：

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform[X].toarray()

现在，你不是在犯你在LabelEncoder中犯的错误。通过调用

OneHotEncoder（…）

可以正确初始化对象。但是使用

fit\u transform[X]

时犯了一个错误。您可以看到，

fit\u transform

是一种方法，应该使用圆括号调用：

fit\u transform（）

有关错误的详细信息，请参阅

正确的代码应为：

from sklearn.preprocessing import OneHotEncoder 
onehotencoder = OneHotEncoder(categorical_features=[0]) 
X = onehotencoder.fit_transform(X).toarray()

Output: 
array([[0., 0., 1., 1., 5., 3.],
       [0., 1., 0., 2., 6., 9.],
       [0., 1., 0., 8., 8., 8.],
       [1., 0., 0., 5., 7., 6.],
       [1., 0., 0., 6., 6., 6.]])

注意：上面的代码应该在已经用LabelEncoder转换的X上调用。如果在原始X上使用它，它仍然会抛出错误

3）LabelBinarizer 这与LabelEncoder没有什么不同，只是它将对提供的列进行一次热编码

from sklearn.preprocessing import LabelBinarizer
labelencoder_X =LabelBinarizer()
new_binarized_val = labelencoder_X.fit_transform(X[:, 0])

Output:
array([[0, 0, 1],
       [0, 1, 0],
       [0, 1, 0],
       [1, 0, 0],
       [1, 0, 0]])

注意：我在您的问题的原始X上使用的LabelBinarizer代码，而不是已经编码的代码。并且输出仅显示第一列的二进制形式

希望这能说明问题

啊，现在一切都有意义了。我没意识到LabelEncoder链接到了一个HotEncoder上。我认为它们是分开的进程，试图做同样的事情。我发现有一个网站建议您需要在OneHotEncoder中使用括号来引用X，但这不起作用。我现在明白了我仍然收到错误的原因是因为在LabelEncoder中发现的错误。非常感谢！