Python 如何在sklearn管道中仅标准化数值变量?
我尝试通过两个步骤创建sklearn管道:Python 如何在sklearn管道中仅标准化数值变量?,python,scikit-learn,Python,Scikit Learn,我尝试通过两个步骤创建sklearn管道: 数据标准化 使用KNN拟合数据 然而,我的数据既有数字变量也有分类变量,我已经使用pd.get\u dummies将其转换为虚拟变量。我想标准化数值变量,但让假人保持原样。我一直在这样做: X = dataframe containing both numeric and categorical columns numeric = [list of numeric column names] categorical = [list of categor
pd.get\u dummies
将其转换为虚拟变量。我想标准化数值变量,但让假人保持原样。我一直在这样做:
X = dataframe containing both numeric and categorical columns
numeric = [list of numeric column names]
categorical = [list of categorical column names]
scaler = StandardScaler()
X_numeric_std = pd.DataFrame(data=scaler.fit_transform(X[numeric]), columns=numeric)
X_std = pd.merge(X_numeric_std, X[categorical], left_index=True, right_index=True)
但是,如果我要创建一个管道,如:
pipe = sklearn.pipeline.make_pipeline(StandardScaler(), KNeighborsClassifier())
它将标准化我的数据框架中的所有列。只有数字列标准化时,有没有办法做到这一点?UPD:2021-05-10 对于
sklearn
>=0.20,我们可以使用
以下是一份:
导入和数据加载
# Author: Pedro Morales <part.morales@gmail.com>
#
# License: BSD 3 clause
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, GridSearchCV
np.random.seed(0)
# Load data from https://www.openml.org/d/40945
X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)
分类
# Append classifier to preprocessing pipeline.
# Now we have a full prediction pipeline.
clf = Pipeline(steps=[('preprocessor', preprocessor),
('classifier', LogisticRegression())])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=0)
clf.fit(X_train, y_train)
print("model score: %.3f" % clf.score(X_test, y_test))
旧答案: 假设您具有以下DF:
In [163]: df
Out[163]:
a b c d
0 aaa 1.01 xxx 111
1 bbb 2.02 yyy 222
2 ccc 3.03 zzz 333
In [164]: df.dtypes
Out[164]:
a object
b float64
c object
d int64
dtype: object
您可以找到所有数字列:
In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')
In [167]: df[num_cols]
Out[167]:
b d
0 1.01 111
1 2.02 222
2 3.03 333
In [168]: scaler = StandardScaler()
In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])
In [170]: df
Out[170]:
a b c d
0 aaa -1.224745 xxx -1.224745
1 bbb 0.000000 yyy 0.000000
2 ccc 1.224745 zzz 1.224745
并仅对那些数字列应用StandardScaler
:
In [165]: num_cols = df.columns[df.dtypes.apply(lambda c: np.issubdtype(c, np.number))]
In [166]: num_cols
Out[166]: Index(['b', 'd'], dtype='object')
In [167]: df[num_cols]
Out[167]:
b d
0 1.01 111
1 2.02 222
2 3.03 333
In [168]: scaler = StandardScaler()
In [169]: df[num_cols] = scaler.fit_transform(df[num_cols])
In [170]: df
Out[170]:
a b c d
0 aaa -1.224745 xxx -1.224745
1 bbb 0.000000 yyy 0.000000
2 ccc 1.224745 zzz 1.224745
现在您可以“一个热编码”分类(非数字)列…我会使用。然后我通常会这样做,假设您也在管道中而不是之前使用Pandas对分类变量进行虚拟编码:
from sklearn.pipeline import Pipeline, FeatureUnion, make_pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.neighbors import KNeighborsClassifier
class Columns(BaseEstimator, TransformerMixin):
def __init__(self, names=None):
self.names = names
def fit(self, X, y=None, **fit_params):
return self
def transform(self, X):
return X[self.names]
numeric = [list of numeric column names]
categorical = [list of categorical column names]
pipe = Pipeline([
("features", FeatureUnion([
('numeric', make_pipeline(Columns(names=numeric),StandardScaler())),
('categorical', make_pipeline(Columns(names=categorical),OneHotEncoder(sparse=False)))
])),
('model', KNeighborsClassifier())
])
您可以进一步查看,这也很有趣 由于您已使用
pd.get_dummies
将分类功能转换为假人,因此不需要使用OneHotEncoder
。因此,您的管道应该是:
from sklearn.preprocessing import StandardScaler,FunctionTransformer
from sklearn.pipeline import Pipeline,FeatureUnion
knn=KNeighborsClassifier()
pipeline=Pipeline(steps= [
('feature_processing', FeatureUnion(transformer_list = [
('categorical', FunctionTransformer(lambda data: data[:, cat_indices])),
#numeric
('numeric', Pipeline(steps = [
('select', FunctionTransformer(lambda data: data[:, num_indices])),
('scale', StandardScaler())
]))
])),
('clf', knn)
]
)
另一种方法是
import pandas as pd
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df = pd.DataFrame()
df['col1'] = np.random.randint(1,20,10)
df['col2'] = np.random.randn(10)
df['col3'] = list(5*'Y' + 5*'N')
numeric_cols = list(df.dtypes[df.dtypes != 'object'].index)
df.loc[:,numeric_cols] = scaler.fit_transform(df.loc[:,numeric_cols])
可能重复的Hey Marcus谢谢你的帖子。那么,您将如何在培训和测试数据上使用此“管道”?管道安装(X_系列,y_系列)?但在这种情况下,编码器的匹配转换步骤将被忽略。但是如果我使用fit\u变换,那么模型拟合部分将被忽略。您可以将其用作任何估计器,并首先调用
pipe.fit(X\u train,y\u train)
。它将调用TransformerMixin
的所有fit\u transform()
调用,然后调用最后一步的fit()
,即估计器。如果随后将其用于预测,则它也将应用于所有变换。这在课堂上也会自动起作用。谢谢马库斯,这很有帮助。@NaveenKumar我没有完全理解你的问题。当前,列(名称=数字)
的输出被传递到StandardScaler()
,然后与OneHotEncoder()
的输出连接,并传递到KNeighborsClassifier()
@Navenkumar查看列转换器
,例如。对于ColumnTransformer
请注意属性Requires
,您可以使用该属性对未命名的列进行建模(删除或传递列)。此解决方案不涉及管道。@Nocibanbi,您是对的,谢谢您的评论!我已经更新了我的答案,所以现在它展示了如何使用相对较新的ColumnTransformer
class;)涉及管道