Scikit learn LabelEncoder（）如何对值进行编码？_Scikit Learn_Linear Regression

Scikit learn LabelEncoder（）如何对值进行编码？

scikit-learn

Scikit learn LabelEncoder（）如何对值进行编码？,scikit-learn,linear-regression,Scikit Learn,Linear Regression,我想知道LabelEncoder（）是如何工作的。这是我代码的一部分 for att in all_features_test: if (str(test_home_data[att].dtypes) == 'object'): test_home_data[att].fillna( 'Nothing', inplace = True) train_home_data[att].fillna( 'Nothing', inplace = True) train_home

我想知道LabelEncoder（）是如何工作的。这是我代码的一部分

for att in all_features_test:
if (str(test_home_data[att].dtypes) == 'object'):
    test_home_data[att].fillna( 'Nothing', inplace = True)
    train_home_data[att].fillna( 'Nothing', inplace = True)

    train_home_data[att] = LabelEncoder().fit_transform(train_home_data[att])
    test_home_data[att] = LabelEncoder().fit_transform(test_home_data[att])
else:
    test_home_data[att].fillna( 0, inplace = True)
    train_home_data[att].fillna( 0, inplace = True)

列车和测试数据集都有一个属性“条件”，可以保存值-坏、平均和良好

假设LabelEncoder（）在train_home_数据中将坏的编码为0，平均编码为2，好的编码为1。现在，测试家庭数据是否也一样

如果没有，那么我该怎么办？

您不应该在拆分后添加标签，而应该在拆分前添加标签

唯一标签（=类）是根据字母表排序的，请参见本源代码中的

uniques=sorted（set（values））

，从中截取指向页面右上角的链接

python方法：

def _encode_python(values, uniques=None, encode=False):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        uniques = sorted(set(values))
        uniques = np.array(uniques, dtype=values.dtype)
    if encode:
        table = {val: i for i, val in enumerate(uniques)}
        try:
            encoded = np.array([table[v] for v in values])
        except KeyError as e:
            raise ValueError("y contains previously unseen labels: %s"
                             % str(e))
        return uniques, encoded
    else:
        return uniques

def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        if encode:
            uniques, encoded = np.unique(values, return_inverse=True)
            return uniques, encoded
        else:
            # unique sorts
            return np.unique(values)
    if encode:
        if check_unknown:
            diff = _encode_check_unknown(values, uniques)
            if diff:
                raise ValueError("y contains previously unseen labels: %s"
                                 % str(diff))
        encoded = np.searchsorted(uniques, values)
        return uniques, encoded
    else:
        return uniques

assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))

assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])

numpy数组与类相同，请参见

返回np.unique（值）

，因为unique（）默认进行排序：

numpy方法：

def _encode_python(values, uniques=None, encode=False):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        uniques = sorted(set(values))
        uniques = np.array(uniques, dtype=values.dtype)
    if encode:
        table = {val: i for i, val in enumerate(uniques)}
        try:
            encoded = np.array([table[v] for v in values])
        except KeyError as e:
            raise ValueError("y contains previously unseen labels: %s"
                             % str(e))
        return uniques, encoded
    else:
        return uniques

def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        if encode:
            uniques, encoded = np.unique(values, return_inverse=True)
            return uniques, encoded
        else:
            # unique sorts
            return np.unique(values)
    if encode:
        if check_unknown:
            diff = _encode_check_unknown(values, uniques)
            if diff:
                raise ValueError("y contains previously unseen labels: %s"
                                 % str(diff))
        encoded = np.searchsorted(uniques, values)
        return uniques, encoded
    else:
        return uniques

assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))

assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])

您永远无法确保测试集和训练集具有完全相同的类。训练或测试集可能只是缺少一个包含三个标签列“条件”的类

如果您急切地希望在列车/测试分割后进行编码，则需要在编码之前检查两组中的类数是否相同

引用脚本：

对对象数据类型使用纯python方法，对所有对象使用numpy方法其他数据类型

python方法（对象类型）：

def _encode_python(values, uniques=None, encode=False):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        uniques = sorted(set(values))
        uniques = np.array(uniques, dtype=values.dtype)
    if encode:
        table = {val: i for i, val in enumerate(uniques)}
        try:
            encoded = np.array([table[v] for v in values])
        except KeyError as e:
            raise ValueError("y contains previously unseen labels: %s"
                             % str(e))
        return uniques, encoded
    else:
        return uniques

def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        if encode:
            uniques, encoded = np.unique(values, return_inverse=True)
            return uniques, encoded
        else:
            # unique sorts
            return np.unique(values)
    if encode:
        if check_unknown:
            diff = _encode_check_unknown(values, uniques)
            if diff:
                raise ValueError("y contains previously unseen labels: %s"
                                 % str(diff))
        encoded = np.searchsorted(uniques, values)
        return uniques, encoded
    else:
        return uniques

assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))

assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])

numpy方法（所有其他类型）：

def _encode_python(values, uniques=None, encode=False):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        uniques = sorted(set(values))
        uniques = np.array(uniques, dtype=values.dtype)
    if encode:
        table = {val: i for i, val in enumerate(uniques)}
        try:
            encoded = np.array([table[v] for v in values])
        except KeyError as e:
            raise ValueError("y contains previously unseen labels: %s"
                             % str(e))
        return uniques, encoded
    else:
        return uniques

def _encode_numpy(values, uniques=None, encode=False, check_unknown=True):
    # only used in _encode below, see docstring there for details
    if uniques is None:
        if encode:
            uniques, encoded = np.unique(values, return_inverse=True)
            return uniques, encoded
        else:
            # unique sorts
            return np.unique(values)
    if encode:
        if check_unknown:
            diff = _encode_check_unknown(values, uniques)
            if diff:
                raise ValueError("y contains previously unseen labels: %s"
                                 % str(diff))
        encoded = np.searchsorted(uniques, values)
        return uniques, encoded
    else:
        return uniques

assert sorted(set(train_home_data[att])) == sorted(set(test_home_data[att]))

assert np.unique(train_home_data[att]) == np.unique(test_home_data[att])

我想我得到了答案

代码

输出

0    A
1    B
2    C
3    D
Name: col1, dtype: object # df1
0    D
1    A
2    A
3    B
Name: col1, dtype: object # df2
0    0
1    1
2    2
3    3
Name: col1, dtype: int64 #df1 encoded
0    2
1    0
2    0
3    1
Name: col1, dtype: int64 #df2 encoded

df1的B被编码为1

以及

df2的B也被编码为1

因此，如果我对训练和测试数据集进行编码，那么训练集中的编码值将反映在测试数据集中（仅当两者都是标签编码时）

我建议将标签编码器安装在一个数据集上，并转换两者：

data1 = [('A', 1), ('B', 2),('C', 3) ,('D', 4)]
data2 = [('D', 1), ('A', 2),('A', 3) ,('B', 4)]

df1 = pd.DataFrame(data1, columns = ['col1', 'col2'])
df2 = pd.DataFrame(data2, columns = ['col1', 'col2'])

# here comes the new code:
le = LabelEncoder()
df1['col1'] = le.fit_transform(df1['col1'])
df2['col1'] = le.transform(df2['col1'])

这回答了你的问题吗？我们的想法是，您只需设置一次（->拟合）编码器，例如在训练数据集上，然后将其应用于测试数据集（无需重新拟合或更改）。@felice谢谢，我还没有得到。同样的问题是，您必须确保在两个数据集中都包含了维度的所有可能属性。但这并不能保证，您需要检查两个数据集是否具有相同的要编码的唯一属性，否则编码器可能会在测试集中发现未知的属性。您可以轻松地从测试集中删除所有标签在训练集中不可用的数据点，因为您将无法对它们进行分类。这应该可以解决问题。训练数据和测试数据是两个不同的数据集。测试数据集的属性是列车数据的子集set@ChaitanyaThombare是的，它们是两个不同的数据集。但测试集通常不是列车数据集的子集，列车和测试集是从原始数据集中分离出来的。如果您确实将测试集作为列车集的子集，则测试集无效。我不明白这些属性如何成为列车数据集的一个子集。如果您的意思是可以保证两个数据集具有相同的属性，那么您可以确保列车和测试集标签中的编码是相同的。对不清楚表示真诚的歉意。这种情况下，列车数据集有85个属性，测试数据集有68个属性。测试数据集中的所有属性都存在于列车数据集中。这两个数据集都是通过两个单独的csv文件提供的。您所说的属性是什么意思。你是指标签还是类别？我猜你指的是标签，因为你可能没有68门课。再次强调，必须保证两个数据集中的标签覆盖相同的类。如果训练集除了[Average，Bad，Good]=[0,1,2]之外还有一个类，例如“Over Average”，字母顺序将是[Average，Over Average，Bad，Good]，然后编码[0,1,2,3]将不同。因此，您必须确保两个集合中包含完全相同的类，因此上面的断言语句。我发现我在这里误解了您，您的意思是68或85个属性列，而不是需要编码的标签列。现在我也看到了你关于子集的观点。所以，您测试模型时使用的属性列比训练时少？那有用吗？脚本通常会崩溃，告诉您“大小/维度不一样”之类的信息。如果测试集中的属性列与列集中的属性列具有相同的内容，则编码将为您提供完全相同的输出。同样，这对于一个模型来说毫无意义。但是您在两个不同的数据集

df1

，

df2

上安装了两个不同的标签编码。如何确保订单保持不变？我的建议是：在df1上使用

.fit\u transform（）

，在df2上使用

.transform（）

。正如@felice所问的，我在df1上尝试了“fit\u transform（）”方法，在df2上尝试了“transform”方法。我得到一个错误：-“此LabelEncoder实例尚未拟合。在使用此估计器之前，请使用适当的参数调用‘拟合’。”我添加了一个解决方案，解释了我的意思。希望有帮助，好的。现在我明白你的意思了。现在我的问题是，如果我用'LabelEncoder（）'而不是'le'，它将不起作用。它有什么不同？使用

LabelEncoder（）

，您每次都会初始化一个新的标签编码器。通过使用相同的一个，在这种情况下，

le

，您可以解决这个问题。如果你认为这个答案是正确的，请把它标记为正确答案。