Python 如何在sklearn管道中仅标准化数值变量?

Python 如何在sklearn管道中仅标准化数值变量?,python,scikit-learn,Python,Scikit Learn,我尝试通过两个步骤创建sklearn管道: 数据标准化 使用KNN拟合数据 然而,我的数据既有数字变量也有分类变量,我已经使用pd.get\u dummies将其转换为虚拟变量。我想标准化数值变量,但让假人保持原样。我一直在这样做: X = dataframe containing both numeric and categorical columns numeric = [list of numeric column names] categorical = [list of categor

我尝试通过两个步骤创建sklearn管道:

  • 数据标准化
  • 使用KNN拟合数据
  • 然而,我的数据既有数字变量也有分类变量,我已经使用
    pd.get\u dummies
    将其转换为虚拟变量。我想标准化数值变量,但让假人保持原样。我一直在这样做:

    X = dataframe containing both numeric and categorical columns
    numeric = [list of numeric column names]
    categorical = [list of categorical column names]
    scaler = StandardScaler()
    X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
    X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)
    
    但是,如果我要创建一个管道,如:

    pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())
    

    它将标准化我的数据框架中的所有列。只有数字列标准化时,有没有办法做到这一点?

    UPD:2021-05-10

    对于
    sklearn
    >=0.20,我们可以使用

    以下是一份:

    导入和数据加载

    # Author: Pedro Morales <part.morales@gmail.com>
    #
    # License: BSD 3 clause
    
    import numpy as np
    
    from sklearn.compose import ColumnTransformer
    from sklearn.datasets import fetch_openml
    from sklearn.pipeline import Pipeline
    from sklearn.impute import SimpleImputer
    from sklearn.preprocessing import StandardScaler, OneHotEncoder
    from sklearn.linear_model import LogisticRegression
    from sklearn.model_selection import train_test_split, GridSearchCV
    
    np.random.seed(0)
    
    # Load data from https://www.openml.org/d/40945
    X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
    
    分类

    # Append classifier to preprocessing pipeline.
    # Now we have a full prediction pipeline.
    clf = Pipeline(steps=[('preprocessor', preprocessor),
                          ('classifier', LogisticRegression())])
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                        random_state=0)
    
    clf.fit(X_train, y_train)
    print("model score: %.3f" % clf.score(X_test, y_test))
    

    旧答案:

    假设您具有以下DF:

    In [163]: df
    Out[163]:
         a     b    c    d
    0  aaa  1.01  xxx  111
    1  bbb  2.02  yyy  222
    2  ccc  3.03  zzz  333
    
    In [164]: df.dtypes
    Out[164]:
    a     object
    b    float64
    c     object
    d      int64
    dtype: object
    
    您可以找到所有数字列:

    In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
    
    In [166]: num_cols
    Out[166]: Index(['b', 'd'], dtype='object')
    
    In [167]: df[num_cols]
    Out[167]:
          b    d
    0  1.01  111
    1  2.02  222
    2  3.03  333
    
    In [168]: scaler = StandardScaler()
    
    In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])
    
    In [170]: df
    Out[170]:
         a         b    c         d
    0  aaa -1.224745  xxx -1.224745
    1  bbb  0.000000  yyy  0.000000
    2  ccc  1.224745  zzz  1.224745
    
    并仅对那些数字列应用
    StandardScaler

    In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
    
    In [166]: num_cols
    Out[166]: Index(['b', 'd'], dtype='object')
    
    In [167]: df[num_cols]
    Out[167]:
          b    d
    0  1.01  111
    1  2.02  222
    2  3.03  333
    
    In [168]: scaler = StandardScaler()
    
    In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])
    
    In [170]: df
    Out[170]:
         a         b    c         d
    0  aaa -1.224745  xxx -1.224745
    1  bbb  0.000000  yyy  0.000000
    2  ccc  1.224745  zzz  1.224745
    
    现在您可以“一个热编码”分类(非数字)列…

    我会使用。然后我通常会这样做,假设您也在管道中而不是之前使用Pandas对分类变量进行虚拟编码:

    from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
    from sklearn.preprocessing import OneHotEncoder
    from sklearn.base import BaseEstimator, TransformerMixin
    from sklearn.neighbors import KNeighborsClassifier
    
    class Columns(BaseEstimator, TransformerMixin):
        def __init__(self, names=None):
            self.names = names
    
        def fit(self, X, y=None, **fit_params):
            return self
    
        def transform(self, X):
            return X[self.names]
    
    numeric = [list of numeric column names]
    categorical = [list of categorical column names]
    
    pipe = Pipeline([
        ("features", FeatureUnion([
            ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
            ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
        ])),
        ('model', KNeighborsClassifier())
    ])
    

    您可以进一步查看,这也很有趣

    由于您已使用
    pd.get_dummies
    将分类功能转换为假人,因此不需要使用
    OneHotEncoder
    。因此,您的管道应该是:

    from sklearn.preprocessing import StandardScaler,FunctionTransformer
    from sklearn.pipeline import Pipeline,FeatureUnion
    
    knn=KNeighborsClassifier()
    
    pipeline=Pipeline(steps= [
        ('feature_processing', FeatureUnion(transformer_list = [
                ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),
    
                #numeric
                ('numeric', Pipeline(steps = [
                    ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                    ('scale', StandardScaler())
                            ]))
            ])),
        ('clf', knn)
        ]
    )
    

    另一种方法是

    import pandas as pd
    from sklearn.preprocessing import StandardScaler
    scaler = StandardScaler()
    df = pd.DataFrame()
    df['col1'] = np.random.randint(1,20,10)
    df['col2'] = np.random.randn(10)
    df['col3'] = list(5*'Y' + 5*'N')
    numeric_cols = list(df.dtypes[df.dtypes != 'object'].index)
    df.loc[:,numeric_cols] = scaler.fit_transform(df.loc[:,numeric_cols])
    

    可能重复的Hey Marcus谢谢你的帖子。那么,您将如何在培训和测试数据上使用此“管道”?管道安装(X_系列,y_系列)?但在这种情况下,编码器的匹配转换步骤将被忽略。但是如果我使用fit\u变换,那么模型拟合部分将被忽略。您可以将其用作任何估计器,并首先调用
    pipe.fit(X\u train,y\u train)
    。它将调用
    TransformerMixin
    的所有
    fit\u transform()
    调用,然后调用最后一步的
    fit()
    ,即估计器。如果随后将其用于预测,则它也将应用于所有变换。这在课堂上也会自动起作用。谢谢马库斯,这很有帮助。@NaveenKumar我没有完全理解你的问题。当前,
    列(名称=数字)
    的输出被传递到
    StandardScaler()
    ,然后与
    OneHotEncoder()
    的输出连接,并传递到
    KNeighborsClassifier()
    @Navenkumar查看
    列转换器
    ,例如。对于
    ColumnTransformer
    请注意属性
    Requires
    ,您可以使用该属性对未命名的列进行建模(删除或传递列)。此解决方案不涉及管道。@Nocibanbi,您是对的,谢谢您的评论!我已经更新了我的答案,所以现在它展示了如何使用相对较新的
    ColumnTransformer
    class;)涉及管道