Python 如何从sklearn管道输出Pandas对象

Python 如何从sklearn管道输出Pandas对象,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,我已经构建了一个管道,它接受一个数据帧,该数据帧已被拆分为分类列和数字列。我试图在我的结果上运行GridSearchCV,并最终查看GridSearchCV选择的最佳性能模型的重要排名功能。我遇到的问题是,sklearn管道输出numpy数组对象,并在过程中丢失任何列信息。因此,当我检查模型中最重要的系数时,我只剩下一个未标记的numpy数组 我曾经读过,构建一个定制的变压器可能是解决这个问题的一个可能的方法,但我自己没有这样做的经验。我还研究了如何利用sklearn pandas软件包,但我对

我已经构建了一个管道,它接受一个数据帧,该数据帧已被拆分为分类列和数字列。我试图在我的结果上运行GridSearchCV,并最终查看GridSearchCV选择的最佳性能模型的重要排名功能。我遇到的问题是,sklearn管道输出numpy数组对象,并在过程中丢失任何列信息。因此,当我检查模型中最重要的系数时,我只剩下一个未标记的numpy数组

我曾经读过,构建一个定制的变压器可能是解决这个问题的一个可能的方法,但我自己没有这样做的经验。我还研究了如何利用sklearn pandas软件包,但我对尝试和实现一些可能无法与sklearn并行更新的东西感到犹豫。有人能提出他们认为解决这个问题的最佳途径吗?我也对任何涉及熊猫和SKL应用的文献持开放态度

我的管道:

# impute and standardize numeric data 
numeric_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
    ('scale', StandardScaler())
])

# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline([
    ('transform', preprocessor),
    ('ridge', Ridge())
])
交叉验证:

kf = KFold(n_splits=4, shuffle=True, random_state=44)

cross_val_score(clf, X_train, y_train, cv=kf).mean()
网格搜索:

param_grid = {
    'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}

gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
检验系数:

model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
model.named_steps['ridge'].coef_
以下是在seaborn“mpg”数据集上执行时的模型系数的输出:


理想情况下,我希望保留pandas数据帧信息,并在调用OneHotEncoder和其他方法后检索派生的列名。

我实际上会从输入中创建列名。如果您的输入已经被划分为数字类别,您可以使用
pd.get_dummies
获取每个类别特征的不同类别的数量

然后,您可以根据问题和一些人工数据为列创建专有名称,如本工作示例的最后一部分所示

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']

X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})

# impute and standardize numeric data 
numeric_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
    ('scale', StandardScaler())
])

# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline([
    ('transform', preprocessor),
    ('ridge', Ridge())
])


kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()

param_grid = {
    'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}

gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)

model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ',  model.named_steps['ridge'].coef_, '\n')

# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)

print('columns after preprocessing :', columns_names_to_map,  '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))
上面的代码(最后)打印出以下数据帧,它是从参数名称到参数值的映射:


可能会有帮助您是否可以将输入传递到管道的“transform”层,并从中获取与“ridge”层的输入(X_train_transf)相对应的列名?在sklearn中使用
ridge
类时,
.coef
数组存储拟合模型的系数,并保留顺序,因此,如果您知道列名称,您可以将它们映射到“未标记”数组:
参数coef\u df=pd.DataFrame({'feature':X_train\u transf.columns,'coefficient':model.named_steps['ridge'].coef})
param_coef_df=param_coef_df.sort_values(by='coefficient')
@JacoSolari您介意将该评论转换为一个显示工作示例的答案吗?@JacoSolari我已经有一段时间没有重新讨论这个问题了,但在我自己的工作中,我在代码中实现了相同的逻辑,将转换后的系数名称组合到一个数据帧中。我相信,在处理管道时,必须将每个单独的转换称为命名步骤仍然存在局限性。如果ColumnTransformer.get\u feature\u names方法支持管道,那就太好了,但到目前为止它还不支持管道。@lurscher我添加了一个答案,让我知道它是否符合您的需要。但是如何检索
SimpleImputer.fit\u transform
可能已删除的任何列?或者,据我所知,
SimpleImputer.fit
的列只会删除只包含缺少值的列(至少在
strategy='constant'
时)。无论如何,如果出于任何原因在Ridge块之前删除了一列,我的代码将抛出一个错误,因为Ridge模型系数的长度和
columns\u names\u to\u map
中的列名将不同。或者,您可以绕过
插补器
,只需选择
df.fillna(a\u value\u you\u compute)
替换缺失的值(数据帧中的缺失值通常是
np.nan
)。
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV

# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']

X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})

# impute and standardize numeric data 
numeric_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
    ('scale', StandardScaler())
])

# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
    ('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
    ('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])

preprocessor = ColumnTransformer(transformers=[
    ('num', numeric_transformer, numeric_features),
    ('cat', categorical_transformer, categorical_features)
])

clf = Pipeline([
    ('transform', preprocessor),
    ('ridge', Ridge())
])


kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()

param_grid = {
    'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}

gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)

model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ',  model.named_steps['ridge'].coef_, '\n')

# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)

print('columns after preprocessing :', columns_names_to_map,  '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))