Python 如何在机器学习中使用数字和分类特征的统一管道？_Python_Machine Learning_Scikit Learn

Python 如何在机器学习中使用数字和分类特征的统一管道？

python machine-learning scikit-learn

Python 如何在机器学习中使用数字和分类特征的统一管道？,python,machine-learning,scikit-learn,Python,Machine Learning,Scikit Learn,要在分类特征上运行编码器，在数字特征上运行插补器（见下文），并将它们统一在一起。例如，具有分类特征的数字： df_with_cat = pd.DataFrame({ 'A' : ['ios', 'android', 'web', 'NaN'], 'B' : [4, 4, 'NaN', 2], 'target' : [1, 1, 0, 0] }) df_with_cat.head()

要在分类特征上运行编码器，在数字特征上运行插补器（见下文），并将它们统一在一起。
例如，具有分类特征的数字：

df_with_cat = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })
df_with_cat.head()

    A        B  target
----------------------
0   ios      4    1
1   android  4    1
2   web     NaN   0
3   NaN      2    0

我们希望在数字特征上运行插补器，即用“最频繁”/“中值”/“平均值”=>管道1替换缺失值/NaN。但我们希望将分类特征转换为数字/OneHotEncoding等==>管道2

统一它们的最佳做法是什么？

p、 s：用分类器统一上面的2…（随机森林/决策树/GBM）

显然有一种很酷的方法！，对于此df：

df_with_cat = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })

如果您不介意将sklearn升级到

0.20.2

，请运行：

pip3 install scikit-learn==0.20.2

并使用此解决方案（如@AI_learning所建议）：

然后：

columnTransformer.fit（df\U和cat）

但如果您使用的是较早的sklearn版本，请使用此版本：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Imputer
from sklearn.preprocessing import LabelBinarizer, LabelEncoder 

CATEGORICAL_FEATURES = ['A']
NUMERICAL_FEATURES = ['B']
TARGET = ['target']

numerical_pipline = Pipeline([
    ('selector', DataFrameSelector(NUMERICAL_FEATURES)),
    ('imputer', Imputer(strategy='most_frequent'))
])

categorical_pipeline = Pipeline([
    ('selector', DataFrameSelector(CATEGORICAL_FEATURES)),
    ('cat_encoder', LabelBinarizerPipelineFriendly())
])

如果您注意到我们错过了

DataFrameSelector

，它不是

sklearn

的一部分，那么让我们在这里写下它：

from sklearn.base import BaseEstimator, TransformerMixin
class DataFrameSelector(BaseEstimator, TransformerMixin):
    def __init__(self, attribute_names):
        self.attribute_names = attribute_names
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return X[self.attribute_names].values

让我们把它们统一起来：

from sklearn.pipeline import FeatureUnion, make_pipeline

preprocessing_pipeline = FeatureUnion(transformer_list=[
    ('numerical_pipline', numerical_pipline),
    ('categorical_pipeline', categorical_pipeline)
])

就这样，现在让我们运行：

preprocessing_pipeline.fit_transform(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES])

现在让我们更疯狂！将它们与分类器管道统一：

from sklearn import tree
clf = tree.DecisionTreeClassifier()
full_pipeline = make_pipeline(preprocessing_pipeline, clf)

一起训练他们：

full_pipeline.fit(df_with_cat[CATEGORICAL_FEATURES+NUMERICAL_FEATURES], df_with_cat[TARGET])

只要打开一个Jupyter笔记本，把代码片段拿出来自己试试就行了

以下是LabelBinarizerPipelineFriendly（）的定义：

这种方法的主要优点是，您可以将经过训练的模型和所有管道转储到pkl文件，然后可以实时使用相同的模型（生产预测）

正如@Sergey Bushmanov所提到的，ColumnTransformer可用于实现相同的模型

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })

categorical_features = ['A']
numeric_features = ['B']
TARGET = ['target']

df[numeric_features]=df[numeric_features].replace('NaN', np.NaN)
columnTransformer = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('num', SimpleImputer( strategy='most_frequent'), numeric_features)])

columnTransformer.fit_transform(df)

#
array([[0., 0., 1., 0., 4.],
   [0., 1., 0., 0., 4.],
   [0., 0., 0., 1., 4.],
   [1., 0., 0., 0., 2.]])

使用

ColumnTransformer

LabelBinarizerPipelineFriendly（）

有一种更简单的方法，需要定义！

class LabelBinarizerPipelineFriendly(LabelBinarizer):
    '''
     Wrapper to LabelBinarizer to allow usage in sklearn.pipeline
    '''

    def fit(self, X, y=None):
        """this would allow us to fit the model based on the X input."""
        super(LabelBinarizerPipelineFriendly, self).fit(X)
    def transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).transform(X)

    def fit_transform(self, X, y=None):
        return super(LabelBinarizerPipelineFriendly, self).fit(X).transform(X)

from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({
           'A'      : ['ios', 'android', 'web', 'NaN'],
           'B'      : [4, 4, 'NaN', 2], 
           'target' : [1, 1, 0, 0] 
       })

categorical_features = ['A']
numeric_features = ['B']
TARGET = ['target']

df[numeric_features]=df[numeric_features].replace('NaN', np.NaN)
columnTransformer = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(), categorical_features),
        ('num', SimpleImputer( strategy='most_frequent'), numeric_features)])

columnTransformer.fit_transform(df)

#
array([[0., 0., 1., 0., 4.],
   [0., 1., 0., 0., 4.],
   [0., 0., 0., 1., 4.],
   [1., 0., 0., 0., 2.]])