Python 如何在sklearn管道中仅标准化数值变量？_Python_Scikit Learn

Python 如何在sklearn管道中仅标准化数值变量？

python scikit-learn

Python 如何在sklearn管道中仅标准化数值变量？,python,scikit-learn,Python,Scikit Learn,我尝试通过两个步骤创建sklearn管道：数据标准化使用KNN拟合数据然而，我的数据既有数字变量也有分类变量，我已经使用pd.get\u dummies将其转换为虚拟变量。我想标准化数值变量，但让假人保持原样。我一直在这样做： X = dataframe containing both numeric and categorical columns numeric = [list of numeric column names] categorical = [list of categor

我尝试通过两个步骤创建sklearn管道：

数据标准化

使用KNN拟合数据

然而，我的数据既有数字变量也有分类变量，我已经使用

pd.get\u dummies

将其转换为虚拟变量。我想标准化数值变量，但让假人保持原样。我一直在这样做：

X = dataframe containing both numeric and categorical columns
numeric = [list of numeric column names]
categorical = [list of categorical column names]
scaler = StandardScaler()
X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)

但是，如果我要创建一个管道，如：

pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())

它将标准化我的数据框架中的所有列。只有数字列标准化时，有没有办法做到这一点？

UPD:2021-05-10

对于

sklearn

>=0.20，我们可以使用

以下是一份：

导入和数据加载

# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause

import numpy as np

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV

np.random.seed(0)

# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

分类

# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('classifier', LogisticRegression())])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
                                                    random_state=0)

clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))

旧答案：

假设您具有以下DF：

In [163]: df
Out[163]:
     a     b    c    d
0  aaa  1.01  xxx  111
1  bbb  2.02  yyy  222
2  ccc  3.03  zzz  333

In [164]: df.dtypes
Out[164]:
a     object
b    float64
c     object
d      int64
dtype: object

您可以找到所有数字列：

In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')

In [167]: df[num_cols]
Out[167]:
      b    d
0  1.01  111
1  2.02  222
2  3.03  333

In [168]: scaler = StandardScaler()

In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])

In [170]: df
Out[170]:
     a         b    c         d
0  aaa -1.224745  xxx -1.224745
1  bbb  0.000000  yyy  0.000000
2  ccc  1.224745  zzz  1.224745

并仅对那些数字列应用

StandardScaler

：

In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]

In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')

In [167]: df[num_cols]
Out[167]:
      b    d
0  1.01  111
1  2.02  222
2  3.03  333

In [168]: scaler = StandardScaler()

In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])

In [170]: df
Out[170]:
     a         b    c         d
0  aaa -1.224745  xxx -1.224745
1  bbb  0.000000  yyy  0.000000
2  ccc  1.224745  zzz  1.224745

现在您可以“一个热编码”分类（非数字）列…

我会使用。然后我通常会这样做，假设您也在管道中而不是之前使用Pandas对分类变量进行虚拟编码：

from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier

class Columns(BaseEstimator, TransformerMixin):
    def __init__(self, names=None):
        self.names = names

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X):
        return X[self.names]

numeric = [list of numeric column names]
categorical = [list of categorical column names]

pipe = Pipeline([
    ("features", FeatureUnion([
        ('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
        ('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
    ])),
    ('model', KNeighborsClassifier())
])

您可以进一步查看，这也很有趣

由于您已使用

pd.get_dummies

将分类功能转换为假人，因此不需要使用

OneHotEncoder

。因此，您的管道应该是：

from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion

knn=KNeighborsClassifier()

pipeline=Pipeline(steps= [
    ('feature_processing', FeatureUnion(transformer_list = [
            ('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),

            #numeric
            ('numeric', Pipeline(steps = [
                ('select', FunctionTransformer(lambda data: data[:, num_indices])),
                ('scale', StandardScaler())
                        ]))
        ])),
    ('clf', knn)
    ]
)

另一种方法是

import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame()
df['col1'] = np.random.randint(1,20,10)
df['col2'] = np.random.randn(10)
df['col3'] = list(5*'Y' + 5*'N')
numeric_cols = list(df.dtypes[df.dtypes != 'object'].index)
df.loc[:,numeric_cols] = scaler.fit_transform(df.loc[:,numeric_cols])

可能重复的Hey Marcus谢谢你的帖子。那么，您将如何在培训和测试数据上使用此“管道”？管道安装（X_系列，y_系列）？但在这种情况下，编码器的匹配转换步骤将被忽略。但是如果我使用fit\u变换，那么模型拟合部分将被忽略。您可以将其用作任何估计器，并首先调用

pipe.fit（X\u train，y\u train）

。它将调用

TransformerMixin

的所有

fit\u transform（）

调用，然后调用最后一步的

fit（）

，即估计器。如果随后将其用于预测，则它也将应用于所有变换。这在课堂上也会自动起作用。谢谢马库斯，这很有帮助。@NaveenKumar我没有完全理解你的问题。当前，

列（名称=数字）

的输出被传递到

StandardScaler（）

，然后与

OneHotEncoder（）

的输出连接，并传递到

KNeighborsClassifier（）

@Navenkumar查看

列转换器

，例如。对于

ColumnTransformer

请注意属性

Requires

，您可以使用该属性对未命名的列进行建模（删除或传递列）。此解决方案不涉及管道。@Nocibanbi，您是对的，谢谢您的评论！我已经更新了我的答案，所以现在它展示了如何使用相对较新的

ColumnTransformer

class；）涉及管道