Python 如何从sklearn管道输出Pandas对象
我已经构建了一个管道,它接受一个数据帧,该数据帧已被拆分为分类列和数字列。我试图在我的结果上运行GridSearchCV,并最终查看GridSearchCV选择的最佳性能模型的重要排名功能。我遇到的问题是,sklearn管道输出numpy数组对象,并在过程中丢失任何列信息。因此,当我检查模型中最重要的系数时,我只剩下一个未标记的numpy数组 我曾经读过,构建一个定制的变压器可能是解决这个问题的一个可能的方法,但我自己没有这样做的经验。我还研究了如何利用sklearn pandas软件包,但我对尝试和实现一些可能无法与sklearn并行更新的东西感到犹豫。有人能提出他们认为解决这个问题的最佳途径吗?我也对任何涉及熊猫和SKL应用的文献持开放态度 我的管道:Python 如何从sklearn管道输出Pandas对象,python,pandas,scikit-learn,Python,Pandas,Scikit Learn,我已经构建了一个管道,它接受一个数据帧,该数据帧已被拆分为分类列和数字列。我试图在我的结果上运行GridSearchCV,并最终查看GridSearchCV选择的最佳性能模型的重要排名功能。我遇到的问题是,sklearn管道输出numpy数组对象,并在过程中丢失任何列信息。因此,当我检查模型中最重要的系数时,我只剩下一个未标记的numpy数组 我曾经读过,构建一个定制的变压器可能是解决这个问题的一个可能的方法,但我自己没有这样做的经验。我还研究了如何利用sklearn pandas软件包,但我对
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
交叉验证:
kf = KFold(n_splits=4, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
网格搜索:
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
检验系数:
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
model.named_steps['ridge'].coef_
以下是在seaborn“mpg”数据集上执行时的模型系数的输出:
理想情况下,我希望保留pandas数据帧信息,并在调用OneHotEncoder和其他方法后检索派生的列名。我实际上会从输入中创建列名。如果您的输入已经被划分为数字类别,您可以使用
pd.get_dummies
获取每个类别特征的不同类别的数量
然后,您可以根据问题和一些人工数据为列创建专有名称,如本工作示例的最后一部分所示
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']
X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ', model.named_steps['ridge'].coef_, '\n')
# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)
print('columns after preprocessing :', columns_names_to_map, '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))
上面的代码(最后)打印出以下数据帧,它是从参数名称到参数值的映射:
可能会有帮助您是否可以将输入传递到管道的“transform”层,并从中获取与“ridge”层的输入(X_train_transf)相对应的列名?在sklearn中使用
ridge
类时,.coef
数组存储拟合模型的系数,并保留顺序,因此,如果您知道列名称,您可以将它们映射到“未标记”数组:参数coef\u df=pd.DataFrame({'feature':X_train\u transf.columns,'coefficient':model.named_steps['ridge'].coef})
,param_coef_df=param_coef_df.sort_values(by='coefficient')
@JacoSolari您介意将该评论转换为一个显示工作示例的答案吗?@JacoSolari我已经有一段时间没有重新讨论这个问题了,但在我自己的工作中,我在代码中实现了相同的逻辑,将转换后的系数名称组合到一个数据帧中。我相信,在处理管道时,必须将每个单独的转换称为命名步骤仍然存在局限性。如果ColumnTransformer.get\u feature\u names方法支持管道,那就太好了,但到目前为止它还不支持管道。@lurscher我添加了一个答案,让我知道它是否符合您的需要。但是如何检索SimpleImputer.fit\u transform
可能已删除的任何列?或者,据我所知,SimpleImputer.fit
的列只会删除只包含缺少值的列(至少在strategy='constant'
时)。无论如何,如果出于任何原因在Ridge块之前删除了一列,我的代码将抛出一个错误,因为Ridge模型系数的长度和columns\u names\u to\u map
中的列名将不同。或者,您可以绕过插补器
,只需选择df.fillna(a\u value\u you\u compute)
替换缺失的值(数据帧中的缺失值通常是np.nan
)。
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import Ridge
from sklearn.model_selection import KFold, cross_val_score, GridSearchCV
# create aritificial data
numeric_features_vals = pd.DataFrame({'x1': [1, 2, 3, 4], 'x2': [0.15, 0.25, 0.5, 0.45]})
numeric_features = ['x1', 'x2']
categorical_features_vals = pd.DataFrame({'cat1': [0, 1, 1, 2], 'cat2': [2, 1, 5, 0] })
categorical_features = ['cat1', 'cat2']
X_train = pd.concat([numeric_features_vals, categorical_features_vals], axis=1)
X_test = pd.DataFrame({'x1':[2,3], 'x2':[0.2, 0.3], 'cat1':[0, 1], 'cat2':[2, 1]})
y_train = pd.DataFrame({'labels': [10, 20, 30, 40]})
# impute and standardize numeric data
numeric_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="mean")),
('scale', StandardScaler())
])
# impute and encode dummy variables for categorical data
categorical_transformer = Pipeline([
('impute', SimpleImputer(missing_values=np.nan, strategy="most_frequent")),
('one_hot', OneHotEncoder(sparse=False, handle_unknown='ignore'))
])
preprocessor = ColumnTransformer(transformers=[
('num', numeric_transformer, numeric_features),
('cat', categorical_transformer, categorical_features)
])
clf = Pipeline([
('transform', preprocessor),
('ridge', Ridge())
])
kf = KFold(n_splits=2, shuffle=True, random_state=44)
cross_val_score(clf, X_train, y_train, cv=kf).mean()
param_grid = {
'ridge__alpha': [.001, .1, 1.0, 5, 10, 100]
}
gs = GridSearchCV(clf, param_grid, cv = kf)
gs.fit(X_train, y_train)
model = gs.best_estimator_
predictions = model.fit(X_train, y_train).predict(X_test)
print('coefficients : ', model.named_steps['ridge'].coef_, '\n')
# create column names for categorical hot encoded data
columns_names_to_map = list(np.copy(numeric_features))
columns_names_to_map.extend('cat1_' + str(col) for col in pd.get_dummies(X_train['cat1']).columns)
columns_names_to_map.extend('cat2_' + str(col) for col in pd.get_dummies(X_train['cat2']).columns)
print('columns after preprocessing :', columns_names_to_map, '\n')
print('#'*80)
print( '\n', 'dataframe of rescaled features with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (preprocessor.fit_transform(X_train).T, columns_names_to_map)}))
print('#'*80)
print( '\n', 'dataframe of ridge coefficients with custom colum names: \n\n', pd.DataFrame({col:vals for vals, col in zip (model.named_steps['ridge'].coef_.T, columns_names_to_map)}))