Python 在使用sklearn模型转换用于培训和预测的数据时，如何确保所有数据都是相同的数字？_Python_Python 3.x_Scikit Learn_Encode_Sklearn Pandas

Python 在使用sklearn模型转换用于培训和预测的数据时，如何确保所有数据都是相同的数字？

python python-3.x scikit-learn

Python 在使用sklearn模型转换用于培训和预测的数据时，如何确保所有数据都是相同的数字？,python,python-3.x,scikit-learn,encode,sklearn-pandas,Python,Python 3.x,Scikit Learn,Encode,Sklearn Pandas,我希望确保传入数据集中的数据与模型训练时使用的数据相同。例如 df = pd.Dataframe({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']}) 一旦进行了转换，应如下所示： >>>df prediction features 1 1 2 2 3 3 现在我想确定一组新的数据 new_df = pd.Dataf

我希望确保传入数据集中的数据与模型训练时使用的数据相同。例如

df = pd.Dataframe({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})

一旦进行了转换，应如下所示：

>>>df
prediction  features
1           1
2           2
3           3

现在我想确定一组新的数据

new_df = pd.Dataframe({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})

将转换为与原始数据帧相同的数据帧

df

。请注意，我确实在

new_df

中添加了一些内容，因为模型也必须处理这个问题。新的数据帧应该是这样的

>>>new_df
prediction  features
4           3
1           2
2           1

如何实现这一点，以及如何对数据进行反向转换？

您可以在此处使用

LabelEncoder

import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(df["prediction"])
oldData = df['prediction'].tolist()
df["prediction"] = le.transform(df["prediction"])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = new_df['prediction'].tolist()
newData = list(set(newData)- set(oldData))
le.classes_ = np.append(le.classes_, newData )
new_df["prediction"] = le.transform(new_df["prediction"])

更新

import pandas as pd
df = pd.DataFrame({'prediction':['red', 'green', 'blue'], 'features': ['one','two','three']})
from sklearn import preprocessing
encoderDict = {}
oldData = {}
for col in df.columns:
    le = preprocessing.LabelEncoder()
    le.fit(df[col])
    encoderDict[col] = le
    oldData[col] = df[col].tolist()
    df[col] = le.transform(df[col])
new_df = pd.DataFrame({'prediction':['yellow', 'red', 'green'], 'features': ['three','two','one']})
newData = {}
for col in new_df.columns:
    newData[col] = new_df[col].tolist()
    newData[col] = list(set(newData[col])- set(oldData[col]))
    encoderDict[col].classes_ = np.append(encoderDict[col].classes_, newData[col] )
    new_df[col] = encoderDict[col].transform(new_df[col])

要反向转换数据，您只需执行以下操作

ndf = df.append(new_df).reset_index(drop=True)
for col in ndf:
    print(encoderDict[col].inverse_transform(ndf[col]))

伟大的非常感谢你。你能把逆变换也加到代码里吗？我现在要测试它，如果一切正常，那么我将把它标记为答案。更新-它没有正常工作<代码>'red'和其他具有相同值的对象将获得新的类标签。它们必须是一样的。。。另外，我非常确定整个数据帧需要转换成数字，否则它将不适合模型。进行了更改后，现在应该可以正常工作了。如果您想对整个数据集运行它，那么将它放入for循环中，并将

“prediction”

更改为所需的列名。我可以使用

df=df.apply（lambda x:le.transform（x））

或类似的方法吗？如果我可以，你能把它也写进代码吗？你可以，但你不应该。对完整数据集使用单个编码器不是一个好主意，但在训练ML模型时，所有内容都需要进行数字编码。你建议我做什么？我需要确保所有值保持不变。。。