Python 如何在Pandas数据帧的几列中进行一次热编码,以供以后使用Scikit学习
假设我有以下数据Python 如何在Pandas数据帧的几列中进行一次热编码,以供以后使用Scikit学习,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,假设我有以下数据 import pandas as pd data = { 'Reference': [1, 2, 3, 4, 5], 'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'], 'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'], 'Mileage': [35000, 45000, 121000, 3500
import pandas as pd
data = {
'Reference': [1, 2, 3, 4, 5],
'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
'Mileage': [35000, 45000, 121000, 35000, 181000],
'Year': [2015, 2014, 2012, 2016, 2013]
}
df = pd.DataFrame(data)
在此基础上,我想对“品牌”和“城镇”两列进行一次热编码,以训练分类器(比如说使用Scikit Learn)并预测年份
一旦分类器被训练,我将希望根据新的输入数据预测年份(不在训练中使用),我将需要重新应用相同的热编码。例如:
new_data = {
'Reference': [6, 7],
'Brand': ['Volvo', 'Audi'],
'Town': ['Stockholm', 'Munich']
}
在这种情况下,如果知道需要对多个列进行编码,并且需要能够在以后对新数据应用相同的编码,那么对Pandas数据帧上的2列进行一次热编码的最佳方法是什么
这是需要考虑的后续问题
演示:
您可以使用pandas提供的get_dummies函数来转换分类值 像这样的
import pandas as pd
data = {
'Reference': [1, 2, 3, 4, 5],
'Brand': ['Volkswagen', 'Volvo', 'Volvo', 'Audi', 'Volkswagen'],
'Town': ['Berlin', 'Berlin', 'Stockholm', 'Munich', 'Berlin'],
'Mileage': [35000, 45000, 121000, 35000, 181000],
'Year': [2015, 2014, 2012, 2016, 2013]
}
df = pd.DataFrame(data)
train = pd.concat([df.get(['Mileage','Reference','Year']),
pd.get_dummies(df['Brand'], prefix='Brand'),
pd.get_dummies(df['Town'], prefix='Town')],axis=1)
对于测试数据,您可以:
new_data = {
'Reference': [6, 7],
'Brand': ['Volvo', 'Audi'],
'Town': ['Stockholm', 'Munich']
}
test = pd.DataFrame(new_data)
test = pd.concat([test.get(['Reference']),
pd.get_dummies(test['Brand'], prefix='Brand'),
pd.get_dummies(test['Town'], prefix='Town')],axis=1)
# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]
如果测试集对于一个热编码列有一个新的不可见值,该怎么办?在这种方法中,将保留或删除这些内容。对不起,我问是因为我听不懂最后一行。
new_data = {
'Reference': [6, 7],
'Brand': ['Volvo', 'Audi'],
'Town': ['Stockholm', 'Munich']
}
test = pd.DataFrame(new_data)
test = pd.concat([test.get(['Reference']),
pd.get_dummies(test['Brand'], prefix='Brand'),
pd.get_dummies(test['Town'], prefix='Town')],axis=1)
# Get missing columns in the training test
missing_cols = set( train.columns ) - set( test.columns )
# Add a missing column in test set with default value equal to 0
for c in missing_cols:
test[c] = 0
# Ensure the order of column in the test set is in the same order than in train set
test = test[train.columns]