Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/340.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Scikit学习-使用RFECV和GridSearch减少功能。系数存储在哪里?_Python_Scikit Learn - Fatal编程技术网

Python Scikit学习-使用RFECV和GridSearch减少功能。系数存储在哪里?

Python Scikit学习-使用RFECV和GridSearch减少功能。系数存储在哪里?,python,scikit-learn,Python,Scikit Learn,我正在使用Scikit learn RFECV为使用交叉验证的逻辑回归选择最重要的特征。假设X是特征的[n,X]数据帧,y表示响应变量: from sklearn.pipeline import make_pipeline from sklearn.grid_search import GridSearchCV from sklearn.cross_validation import StratifiedKFold from sklearn import preprocessing from s

我正在使用Scikit learn RFECV为使用交叉验证的逻辑回归选择最重要的特征。假设X是特征的[n,X]数据帧,y表示响应变量:

from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn import preprocessing
from sklearn.feature_selection import RFECV
import sklearn
import sklearn.linear_model as lm
import sklearn.grid_search as gs

#  Create a logistic regression estimator 
logreg = lm.LogisticRegression()

# Use RFECV to pick best features, using Stratified Kfold
rfecv =   RFECV(estimator=logreg, cv=StratifiedKFold(y, 3), scoring='roc_auc')

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

pipe = make_pipeline(preprocessing.StandardScaler(), lm.LogisticRegression())

# Define a range of hyper parameters for grid search
C_range = 10.**np.arange(-5, 1)
penalty_options = ['l1', 'l2']

skf = StratifiedKFold(y, 3)
param_grid = dict(logisticregression__C=C_range,  logisticregression__penalty=penalty_options)

grid = GridSearchCV(pipe, param_grid, cv=skf, scoring='roc_auc')

grid.fit(X_new, y) 

a) 这是特征、超参数选择和拟合的正确过程吗

b) 在哪里可以找到所选特征的拟合系数?

这是特征选择的正确过程吗? 这是多种特征选择方法之一。递归特征消除是一种自动化的方法。它们有不同的优点和缺点,通常特征选择最好通过涉及常识和尝试具有不同特征的模型来实现。RFE是一种快速选择一组好特性的方法,但不一定能提供最终最好的特性。顺便说一下,您不需要单独构建分层折叠。如果您只是将
。 你也可以合并

# Fit the features to the response variable
rfecv.fit(X, y)

# Put the best features into new df X_new
X_new = rfecv.transform(X)

这是选择超参数的正确过程吗? GridSearchCV基本上是一种自动化的方法,可以系统地尝试一整套模型参数组合,并根据一些性能指标从中选择最佳。是的,这是找到合适参数的好方法

这是正确的安装过程吗? 是的,这是一种适合模型的有效方法。当您调用

grid.best\u estimator\u
grid.best\u estimator.coef\u
grid.best\u estimator.intercept\u


from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd

# simulate some artifical data so that I can show you the result of each intermediate step
# 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])


Out[78]: (1000, 100)


Out[79]: (1000,)

# Nested Cross-Validation, this returns an train/test index interator
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
# to take a look at the split, you will see it has 5 tuples
# the 1st fold
train_index = list(split)[0][0]

Out[80]: array([  0,   1,   2, ..., 997, 998, 999])

test_index = list(split)[0][1]

Out[81]: array([  5,  12,  17, ..., 979, 982, 984])

# let's play with just one iteration for now
# your pipe
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# set up params
params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
                    logisticregression__penalty=['l1', 'l2'],
                    logisticregression__class_weight=[None, 'auto'])

# apply your grid search only in train data but with a futher cv step
# so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
# all performance is still evaluated in a separated held-out 'test' set
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
# fit the data on train set
grid.fit(X[train_index], y[train_index])

# to get the params of your estimator, call your gscv
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0))])

# the performance in validation set
[mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]

# or the best score among them
Out[84]: 0.94002188482393367

# now after finishing training the estimator, we now predict in test set
y_pred = grid.predict(X[test_index])
# since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
y_pred_probs = grid.predict_proba(X[test_index])

array([[ 0.0632,  0.9368],
       [ 0.0236,  0.9764],
       [ 0.0227,  0.9773],
       [ 0.0108,  0.9892],
       [ 0.2903,  0.7097],
       [ 0.0113,  0.9887]])

# to get evaluation result, 
print(classification_report(y[test_index], y_pred))

             precision    recall  f1-score   support

          0       0.93      0.59      0.72        22
          1       0.95      0.99      0.97       179

avg / total       0.95      0.95      0.95       201

# to put all things together with the nested cross-validation
# generate a pandas dataframe to store prediction probability
kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
report = []  # to store classificaiton report

split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)

for train_index, test_index in split:

    grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')

    grid.fit(X[train_index], y[train_index])

    y_pred_probs = grid.predict_proba(X[test_index])
    kfold_df.iloc[test_index, :] = y_pred_probs

    y_pred = grid.predict(X[test_index])
    report.append(classification_report(y[test_index], y_pred))

# your result

          0       1
0    0.1710  0.8290
1    0.0083  0.9917
2    0.2049  0.7951
3    0.0038  0.9962
4    0.0536  0.9464
5    0.0632  0.9368
6    0.1243  0.8757
7    0.1150  0.8850
8    0.0796  0.9204
9    0.4096  0.5904
..      ...     ...
990  0.0505  0.9495
991  0.2128  0.7872
992  0.0270  0.9730
993  0.0434  0.9566
994  0.8078  0.1922
995  0.1452  0.8548
996  0.1372  0.8628
997  0.0127  0.9873
998  0.0935  0.9065
999  0.0065  0.9935

[1000 rows x 2 columns]

for r in report:

for r in report:
             precision    recall  f1-score   support

          0       0.93      0.59      0.72        22
          1       0.95      0.99      0.97       179

avg / total       0.95      0.95      0.95       201

             precision    recall  f1-score   support

          0       0.86      0.55      0.67        22
          1       0.95      0.99      0.97       179

avg / total       0.94      0.94      0.93       201

             precision    recall  f1-score   support

          0       0.89      0.38      0.53        21
          1       0.93      0.99      0.96       179

avg / total       0.93      0.93      0.92       200

             precision    recall  f1-score   support

          0       0.88      0.33      0.48        21
          1       0.93      0.99      0.96       178

avg / total       0.92      0.92      0.91       199

             precision    recall  f1-score   support

          0       0.88      0.33      0.48        21
          1       0.93      0.99      0.96       178

avg / total       0.92      0.92      0.91       199

from sklearn.datasets import make_classification
from sklearn.pipeline import make_pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.cross_validation import StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import RFECV
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
import pandas as pd

# simulate some artifical data so that I can show you the result of each intermediate step
# 1000 obs, X dim 1000-by-100, 2 different y labels with unbalanced weights
X, y = make_classification(n_samples=1000, n_features=100, n_informative=5, n_classes=2, weights=[0.1, 0.9])


Out[78]: (1000, 100)


Out[79]: (1000,)

# Nested Cross-Validation, this returns an train/test index interator
split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)
# to take a look at the split, you will see it has 5 tuples
# the 1st fold
train_index = list(split)[0][0]

Out[80]: array([  0,   1,   2, ..., 997, 998, 999])

test_index = list(split)[0][1]

Out[81]: array([  5,  12,  17, ..., 979, 982, 984])

# let's play with just one iteration for now
# your pipe
pipe = make_pipeline(StandardScaler(), LogisticRegression())

# set up params
params_space = dict(logisticregression__C=10.0**np.arange(-5,1),
                    logisticregression__penalty=['l1', 'l2'],
                    logisticregression__class_weight=[None, 'auto'])

# apply your grid search only in train data but with a futher cv step
# so original train set has [gscv_train, gscv_validation] where the latter is used to tune hyperparameters
# all performance is still evaluated in a separated held-out 'test' set
grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')
# fit the data on train set
grid.fit(X[train_index], y[train_index])

# to get the params of your estimator, call your gscv
Pipeline(steps=[('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('logisticregression', LogisticRegression(C=0.10000000000000001, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', penalty='l1', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0))])

# the performance in validation set
[mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.87975, std: 0.01753, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87985, std: 0.01746, params: {'logisticregression__C': 1.0000000000000001e-05, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88033, std: 0.01707, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87975, std: 0.01732, params: {'logisticregression__C': 0.0001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88245, std: 0.01732, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87955, std: 0.01686, params: {'logisticregression__C': 0.001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.88746, std: 0.02318, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.50000, std: 0.00000, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87990, std: 0.01634, params: {'logisticregression__C': 0.01, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.94002, std: 0.02959, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.87419, std: 0.02174, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.93508, std: 0.03101, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.87091, std: 0.01860, params: {'logisticregression__C': 0.10000000000000001, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'},
 mean: 0.88013, std: 0.03246, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l1'},
 mean: 0.85247, std: 0.02712, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': None, 'logisticregression__penalty': 'l2'},
 mean: 0.88904, std: 0.02906, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l1'},
 mean: 0.85197, std: 0.02097, params: {'logisticregression__C': 1.0, 'logisticregression__class_weight': 'auto', 'logisticregression__penalty': 'l2'}]

# or the best score among them
Out[84]: 0.94002188482393367

# now after finishing training the estimator, we now predict in test set
y_pred = grid.predict(X[test_index])
# since LogisticRegression is probability based model, we have the luxury to get the propability for each obs
y_pred_probs = grid.predict_proba(X[test_index])

array([[ 0.0632,  0.9368],
       [ 0.0236,  0.9764],
       [ 0.0227,  0.9773],
       [ 0.0108,  0.9892],
       [ 0.2903,  0.7097],
       [ 0.0113,  0.9887]])

# to get evaluation result, 
print(classification_report(y[test_index], y_pred))

             precision    recall  f1-score   support

          0       0.93      0.59      0.72        22
          1       0.95      0.99      0.97       179

avg / total       0.95      0.95      0.95       201

# to put all things together with the nested cross-validation
# generate a pandas dataframe to store prediction probability
kfold_df = pd.DataFrame(0.0, index=np.arange(len(y)), columns=unique(y))
report = []  # to store classificaiton report

split = StratifiedKFold(y, n_folds=5, shuffle=True, random_state=1)

for train_index, test_index in split:

    grid = GridSearchCV(pipe, params_space, cv=StratifiedKFold(y[train_index], n_folds=3), scoring='roc_auc')

    grid.fit(X[train_index], y[train_index])

    y_pred_probs = grid.predict_proba(X[test_index])
    kfold_df.iloc[test_index, :] = y_pred_probs

    y_pred = grid.predict(X[test_index])
    report.append(classification_report(y[test_index], y_pred))

# your result

          0       1
0    0.1710  0.8290
1    0.0083  0.9917
2    0.2049  0.7951
3    0.0038  0.9962
4    0.0536  0.9464
5    0.0632  0.9368
6    0.1243  0.8757
7    0.1150  0.8850
8    0.0796  0.9204
9    0.4096  0.5904
..      ...     ...
990  0.0505  0.9495
991  0.2128  0.7872
992  0.0270  0.9730
993  0.0434  0.9566
994  0.8078  0.1922
995  0.1452  0.8548
996  0.1372  0.8628
997  0.0127  0.9873
998  0.0935  0.9065
999  0.0065  0.9935

[1000 rows x 2 columns]

for r in report:

for r in report:
             precision    recall  f1-score   support

          0       0.93      0.59      0.72        22
          1       0.95      0.99      0.97       179

avg / total       0.95      0.95      0.95       201

             precision    recall  f1-score   support

          0       0.86      0.55      0.67        22
          1       0.95      0.99      0.97       179

avg / total       0.94      0.94      0.93       201

             precision    recall  f1-score   support

          0       0.89      0.38      0.53        21
          1       0.93      0.99      0.96       179

avg / total       0.93      0.93      0.92       200

             precision    recall  f1-score   support

          0       0.88      0.33      0.48        21
          1       0.93      0.99      0.96       178

avg / total       0.92      0.92      0.91       199

             precision    recall  f1-score   support

          0       0.88      0.33      0.48        21
          1       0.93      0.99      0.96       178

avg / total       0.92      0.92      0.91       199