Python 如何在机器学习中使用数字和分类特征的统一管道?
要在分类特征上运行编码器,在数字特征上运行插补器(见下文),并将它们统一在一起。Python 如何在机器学习中使用数字和分类特征的统一管道?,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,要在分类特征上运行编码器,在数字特征上运行插补器(见下文),并将它们统一在一起。 例如,具有分类特征的数字: df_with_cat = pd.DataFrame({ 'A' : ['ios', 'android', 'web', 'NaN'], 'B' : [4, 4, 'NaN', 2], 'target' : [1, 1, 0, 0] }) df_with_cat.head()
例如,具有分类特征的数字:
df_with_cat = pd.DataFrame({
'A' : ['ios', 'android', 'web', 'NaN'],
'B' : [4, 4, 'NaN', 2],
'target' : [1, 1, 0, 0]
})
df_with_cat.head()
A B target
----------------------
0 ios 4 1
1 android 4 1
2 web NaN 0
3 NaN 2 0
我们希望在数字特征上运行插补器,即用“最频繁”/“中值”/“平均值”=>管道1替换缺失值/NaN。但我们希望将分类特征转换为数字/OneHotEncoding等==>管道2
统一它们的最佳做法是什么?p、 s:用分类器统一上面的2…(随机森林/决策树/GBM)显然有一种很酷的方法!,对于此df:
df_with_cat = pd.DataFrame({
'A' : ['ios', 'android', 'web', 'NaN'],
'B' : [4, 4, 'NaN', 2],
'target' : [1, 1, 0, 0]
})
如果您不介意将sklearn升级到0.20.2
,请运行:
pip3 install scikit-learn==0.20.2
并使用此解决方案(如@AI_learning所建议):
然后:
columnTransformer.fit(df\U和cat)
但如果您使用的是较早的sklearn版本,请使用此版本:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
CATEGORICAL_FEATURES = ['A']
NUMERICAL_FEATURES = ['B']
TARGET = ['target']
numerical_pipline = Pipeline([
('selector', DataFrameSelector(NUMERICAL_FEATURES)),
('imputer', Imputer(strategy='most_frequent'))
])
categorical_pipeline = Pipeline([
('selector', DataFrameSelector(CATEGORICAL_FEATURES)),
('cat_encoder', LabelBinarizerPipelineFriendly())
])
如果您注意到我们错过了DataFrameSelector
,它不是sklearn
的一部分,那么让我们在这里写下它:
from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
def __init__(self, attribute_names):
self.attribute_names = attribute_names
def fit(self, X, y=None):
return self
def transform(self, X):
return X[self.attribute_names].values
让我们把它们统一起来:
from sklearn.pipeline import FeatureUnion, make_pipeline
preprocessing_pipeline = FeatureUnion(transformer_list=[
('numerical_pipline', numerical_pipline),
('categorical_pipeline', categorical_pipeline)
])
就这样,现在让我们运行:
preprocessing_pipeline.fit_transform(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES])
现在让我们更疯狂!
将它们与分类器管道统一:
from sklearn import tree
clf = tree.DecisionTreeClassifier()
full_pipeline = make_pipeline(preprocessing_pipeline, clf)
一起训练他们:
full_pipeline.fit(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES], df_with_cat[TARGET])
只要打开一个Jupyter笔记本,把代码片段拿出来自己试试就行了
以下是LabelBinarizerPipelineFriendly()的定义:
这种方法的主要优点是,您可以将经过训练的模型和所有管道转储到pkl文件,然后可以实时使用相同的模型(生产预测)正如@Sergey Bushmanov所提到的,ColumnTransformer可用于实现相同的模型
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({
'A' : ['ios', 'android', 'web', 'NaN'],
'B' : [4, 4, 'NaN', 2],
'target' : [1, 1, 0, 0]
})
categorical_features = ['A']
numeric_features = ['B']
TARGET = ['target']
df[numeric_features]=df[numeric_features].replace('NaN', np.NaN)
columnTransformer = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), categorical_features),
('num', SimpleImputer( strategy='most_frequent'), numeric_features)])
columnTransformer.fit_transform(df)
#
array([[0., 0., 1., 0., 4.],
[0., 1., 0., 0., 4.],
[0., 0., 0., 1., 4.],
[1., 0., 0., 0., 2.]])
使用
ColumnTransformer
LabelBinarizerPipelineFriendly()
有一种更简单的方法,需要定义!
class LabelBinarizerPipelineFriendly(LabelBinarizer):
'''
Wrapper to LabelBinarizer to allow usage in sklearn.pipeline
'''
def fit(self, X, y=None):
"""this would allow us to fit the model based on the X input."""
super(LabelBinarizerPipelineFriendly, self).fit(X)
def transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).transform(X)
def fit_transform(self, X, y=None):
return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame({
'A' : ['ios', 'android', 'web', 'NaN'],
'B' : [4, 4, 'NaN', 2],
'target' : [1, 1, 0, 0]
})
categorical_features = ['A']
numeric_features = ['B']
TARGET = ['target']
df[numeric_features]=df[numeric_features].replace('NaN', np.NaN)
columnTransformer = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(), categorical_features),
('num', SimpleImputer( strategy='most_frequent'), numeric_features)])
columnTransformer.fit_transform(df)
#
array([[0., 0., 1., 0., 4.],
[0., 1., 0., 0., 4.],
[0., 0., 0., 1., 4.],
[1., 0., 0., 0., 2.]])