Python 如何从scikit学习中正确选择SelectFromModel功能？_Python_Machine Learning_Scikit Learn_Logistic Regression_Gridsearchcv

Python 如何从scikit学习中正确选择SelectFromModel功能？

python machine-learning scikit-learn

Python 如何从scikit学习中正确选择SelectFromModel功能？,python,machine-learning,scikit-learn,logistic-regression,gridsearchcv,Python,Machine Learning,Scikit Learn,Logistic Regression,Gridsearchcv,我用a来理解SelectFromModel和逻辑回归是如何工作的。其想法是创建一个非常简单的管道，进行一些基本的数据处理（删除列+缩放），将其传递给feature selection（logreg），然后拟合xgboost模型（代码中未包含）。通过阅读，我的理解是，给定我的X_序列和y_序列，将拟合logreg模型，并选择系数高于或等于阈值的特征。在我的例子中，我将阈值设置为平均值*1.25 我无法理解为什么输出的选择器.阈值与选择器.估计器

我用a来理解SelectFromModel和逻辑回归是如何工作的。其想法是创建一个非常简单的管道，进行一些基本的数据处理（删除列+缩放），将其传递给feature selection（logreg），然后拟合xgboost模型（代码中未包含）。通过阅读，我的理解是，给定我的X_序列和y_序列，将拟合logreg模型，并选择系数高于或等于阈值的特征。在我的例子中，我将阈值设置为平均值*1.25

我无法理解为什么输出的

选择器.阈值

与

选择器.估计器

我希望得到相同的值，为什么不是这样？

接下来，我想做GridSearchCV来微调我的管道参数。我通常是这样做的：
from sklearn.model_selection import GridSearchCV

params = {}
params['gradientboostingclassifier__learning_rate'] = [0.05, 0.1, 0.2]
params['selectfrommodel__estimator__C'] = [0.1, 1, 10]
params['selectfrommodel__estimator__penalty']= ['l1', 'l2']
params['selectfrommodel__estimator__threshold']=['median', 'mean', '1.25*mean', '0.75*mean']

grid = GridSearchCV(pipe, params, cv=5, scoring='recall')
%time grid.fit(X_train, y_train);

不幸的是，阈值似乎不在参数列表中（pipe.named_steps.selectfrommodel.estimator.get_params（）.keys（）
），因此需要对GridSearchCV的这一行进行注释
params['selectfrommodel__estimator__threshold']=['median', 'mean', '1.25*mean', '0.75*mean']

有没有办法微调阈值？
因为重要性是基于系数的绝对值的平均值。如果对相对值进行平均，则平均重要性将降低
我已经建立了一个示例来演示这种行为：
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
X = [[ 0.87, -1.34,  0.31 ],
     [-2.79, -0.02, -0.85 ],
     [-1.34, -0.48, -2.55 ],
     [ 1.92,  1.48,  0.65 ]]
y = [0, 1, 0, 1]
selector = SelectFromModel(estimator=LogisticRegression(), threshold="1.25*mean").fit(X, y)
print(selector.estimator_.coef_)
print(selector.threshold_) # 0.6905659148858644
# note here the absolute transformation before the mean
print(abs(selector.estimator_.coef_).mean()*1.25) # 0.6905659148858644

还请注意，功能重要性是模型训练的结果，而不是您可以预先定义的。这就是原因，因为您无法达到阈值，该阈值只有在您接受培训后才能获得
@Nikaido
第一部分的问题完全正确，并且缺少了abs（）
。这意味着abs（selector.estimator_uf.coef_uf）.mean（）*1.25
等于selector.threshold_f

对于第二部分，这确实是可能的，正确的方法是改变这一行：
params['selectfrommodel__estimator__threshold']=['median', 'mean', '1.25*mean', '0.75*mean']

另一行：
params['selectfrommodel__threshold']=['median', 'mean', '1.25*mean', '0.75*mean']

由于threshold
是来自selectfrommodel
的参数，而不是来自estimator
，请参见下文如何获取这两种情况的完整列表，以便进一步调整超参数使用这些参数：
pipe.named_steps.selectfrommodel.get_params().keys() 
pipe.named_steps.selectfrommodel.estimator.get_params().keys() 

第一部分已经准备好了！你说的“模型训练的结果”是什么意思，你指的是logreg模型，对吗？您能否进一步详细说明？@G.Macia在训练后获得模型系数，只有在训练后才能详细说明阈值。您不能在培训之前选择一组阈值，因为它不是模型的超参数。不要忘记接受最佳答案：）超参数通常在训练之前定义，而不是在训练之后。@G.Macia更清楚地说，系数是给定一组超参数（例如估计惩罚）的训练结果。您不打算在培训期间选择您的功能