Scikit learn 特征选择与预测_Scikit Learn

Scikit learn 特征选择与预测

scikit-learn

Scikit learn 特征选择与预测,scikit-learn,Scikit Learn,我有X和Y数据 from sklearn.feature_selection import RFECV from sklearn.metrics import accuracy_score from sklearn.model_selection import cross_val_predict, KFold from sklearn.preprocessing import StandardScaler from sklearn.ensemble import RandomForestClas

我有X和Y数据

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris

我想用k-fold验证方法实现RFECV特征选择和预测

根据答案更正代码@ 编辑（对于剩余的小部分）：

不要将StandardScaler和RFECV包装在同一管道中，而是对StandardScaler和RandomForestClassifier执行该操作，并将该管道作为估计器传递给RFECV。在此情况下，不会泄露任何traininf信息

X_new = rfecv.transform(X)
print X_new.shape

y_predicts = cross_val_predict(clf, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

更新：关于错误

“运行时错误：分类器不公开“coef”或“feature\u importances”属性”

是的，这是scikit学习管道中的一个已知问题。您可以查看我的其他更多详细信息，并使用我在那里创建的新管道

定义如下所示的自定义管道：

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

pipeline = Pipeline(estimators)


rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)

并使用：

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_

更新2： @brute，对于您的数据和代码，算法在我的电脑上一分钟内完成。这是我使用的完整代码：

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)

更新3：用于交叉预测

下面是我们将如何做到这一点：

适合于训练集在测试集上预测

@没有。这段代码在任何地方都不使用StandardScaler。您只需在管道中定义它，但它没有被使用（安装在任何地方）。执行此操作时，

pipeline.named_steps['rfecv'].fit（X_train，y_train）

直接在原始数据上使用rfecv，而不是缩放数据。@brute。代码已更新。这将正确地使用管道来缩放和fir RFE。@VivekKumar。请定义泄漏数据。你错了。我真的不在乎

rfecv

如何处理训练数据

x_train

。这里重要的是，我们首先使用

train\u test\u split

方法将数据集拆分为一个训练集和一个测试集。我们拟合了

train

集合的

rfecv

方法，并对

测试集进行了预测。测试
集合中的数据不会泄漏到列车集合中。不要混淆操作。这就是问题所在。RFECV将再次将X_列车拆分为列车和测试（使用cv折叠），在拆分之前，数据会进行缩放，因此RFECV的列车数据，然后模型知道该测试数据的大小，因为它使用测试数据进行缩放（这里我指的是内部列车和测试）。然后你会发现这些特性对这一点很重要，这是有偏见的。这是一个过于复杂和不必要的黑客行为。@EkabaBisong可能有点复杂，但不是不必要的。这样做是为了防止数据泄漏。
class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy')
rfecv_data = rfecv.fit(X, Y)

import numpy as np
import glob
from sklearn.utils import resample
files = glob.glob('/home/Downloads/Untitled Folder/*') 
outs = [] 
for fi in files: 
    data = np.genfromtxt(fi, delimiter='|', dtype=float) 
    data = data[~np.isnan(data).any(axis=1)] 
    data = resample(data, replace=False, n_samples=1800, random_state=0) 
    outs.append(data) 

X = np.vstack(outs) 
print X.shape 
Y = np.repeat([1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 1800) 
print Y.shape

#from sklearn.utils import shuffle
#X, Y = shuffle(X, Y, random_state=0)

from sklearn.feature_selection import RFECV
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

clf = RandomForestClassifier()

kf = KFold(n_splits=10, shuffle=True, random_state=0)  

estimators = [('standardize' , StandardScaler()),
              ('clf', RandomForestClassifier())]

class Mypipeline(Pipeline):
    @property
    def coef_(self):
        return self._final_estimator.coef_
    @property
    def feature_importances_(self):
        return self._final_estimator.feature_importances_ 

pipeline = Mypipeline(estimators)

rfecv = RFECV(estimator=pipeline, scoring='accuracy', verbose=10)
rfecv_data = rfecv.fit(X, Y)

print ('no. of selected features =', rfecv_data.n_features_) 

X_new = rfecv.transform(X)
print X_new.shape

# Here change clf to pipeline, 
# because RFECV has found features according to scaled data,
# which is not present when you pass clf 
y_predicts = cross_val_predict(pipeline, X_new, Y, cv=kf)
accuracy = accuracy_score(Y, y_predicts)
print ('accuracy =', accuracy)

from sklearn.feature_selection import RFECV
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_predict, KFold
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

data = load_iris()    
X = data.data, Y = data.target

# split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, shuffle=True)

# create model
clf = RandomForestClassifier()    
# instantiate K-Fold
kf = KFold(n_splits=10, shuffle=True, random_state=0)

# pipeline estimators
estimators = [('standardize' , StandardScaler()),
             ('rfecv', RFECV(estimator=clf, cv=kf, scoring='accuracy'))]

# instantiate pipeline
pipeline = Pipeline(estimators)    
# fit rfecv to train model
rfecv_model = rfecv_model = pipeline.fit(X_train, y_train)

# print number of selected features
print ('no. of selected features =', pipeline.named_steps['rfecv'].n_features_)
# print feature ranking
print ('ranking =', pipeline.named_steps['rfecv'].ranking_)

'Output':
no. of selected features = 3
ranking = [1 2 1 1]

# make predictions on the test set
predictions = rfecv_model.predict(X_test)

# evaluate the model performance using accuracy metric
print("Accuracy on test set: ", accuracy_score(y_test, predictions))

'Output':
Accuracy:  0.9736842105263158