Python 基于scikit学习的随机森林递归特征消除
我正在尝试使用Python 基于scikit学习的随机森林递归特征消除,python,pandas,scikit-learn,random-forest,feature-selection,Python,Pandas,Scikit Learn,Random Forest,Feature Selection,我正在尝试使用scikit learn和随机森林分类器进行递归特征消除,并使用OOB ROC作为对递归过程中创建的每个子集进行评分的方法 然而,当我尝试使用RFECV方法时,我得到一个错误,说AttributeError:“RandomForestClassifier”对象没有属性“coef” 随机森林本身并没有系数,但它们确实根据基尼得分进行排名。所以,我想知道如何解决这个问题 请注意,我想使用一种方法,明确告诉我在最佳分组中从我的pandasDataFrame中选择了哪些特征,因为我使用递归
scikit learn
和随机森林分类器进行递归特征消除,并使用OOB ROC作为对递归过程中创建的每个子集进行评分的方法
然而,当我尝试使用RFECV
方法时,我得到一个错误,说AttributeError:“RandomForestClassifier”对象没有属性“coef”
随机森林本身并没有系数,但它们确实根据基尼得分进行排名。所以,我想知道如何解决这个问题
请注意,我想使用一种方法,明确告诉我在最佳分组中从我的pandas
DataFrame中选择了哪些特征,因为我使用递归特征选择来尽量减少输入最终分类器的数据量
下面是一些示例代码:
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pd.Series(iris.target, name='target')
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=10, scoring='ROC', verbose=2)
selector=rfecv.fit(x, y)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 336, in fit
ranking_ = rfe.fit(X_train, y_train).ranking_
File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 148, in fit
if estimator.coef_.ndim > 1:
AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'
从sklearn导入数据集
作为pd进口熊猫
从熊猫进口系列
从sklearn.employ导入随机林分类器
从sklearn.feature_选择导入RFECV
iris=数据集。加载\u iris()
x=pd.DataFrame(iris.data,列=['var1'、'var2'、'var3'、'var4']
y=pd.Series(iris.target,name='target')
rf=随机森林分类器(n_估计值=500,最小样本数=5,n_作业数=-1)
rfecv=rfecv(估计器=rf,步长=1,变异系数=10,评分=ROC',详细程度=2)
选择器=rfecv.配合(x,y)
回溯(最近一次呼叫最后一次):
文件“”,第1行,在
文件“/Users/bbalin/anaconda/lib/python2.7/site packages/sklearn/feature_selection/rfe.py”,第336行
排名=rfe.fit(X\u列,y\u列)。排名_
文件“/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_-selection/rfe.py”,第148行,以合适的格式
如果估计器系数ndim>1:
AttributeError:“RandomForestClassifier”对象没有属性“coef\ux”
这是我的代码,我已经整理了一下,使之与您的任务相关:
features_to_use = fea_cols # this is a list of features
# empty dataframe
trim_5_df = DataFrame(columns=features_to_use)
run=1
# this will remove the 5 worst features determined by their feature importance computed by the RF classifier
while len(features_to_use)>6:
print('number of features:%d' % (len(features_to_use)))
# build the classifier
clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
# train the classifier
clf.fit(train[features_to_use], train['OpenStatusMod'].values)
print('classifier score: %f\n' % clf.score(train[features_to_use], df['OpenStatusMod'].values))
# predict the class and print the classification report, f1 micro, f1 macro score
pred = clf.predict(test[features_to_use])
print(classification_report(test['OpenStatusMod'].values, pred, target_names=status_labels))
print('micro score: ')
print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='micro'))
print('macro score:\n')
print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='macro'))
# predict the class probabilities
probs = clf.predict_proba(test[features_to_use])
# rescale the priors
new_probs = kf.cap_and_update_priors(priors, probs, private_priors, 0.001)
# calculate logloss with the rescaled probabilities
print('log loss: %f\n' % log_loss(test['OpenStatusMod'].values, new_probs))
row={}
if hasattr(clf, "feature_importances_"):
# sort the features by importance
sorted_idx = np.argsort(clf.feature_importances_)
# reverse the order so it is descending
sorted_idx = sorted_idx[::-1]
# add to dataframe
row['num_features'] = len(features_to_use)
row['features_used'] = ','.join(features_to_use)
# trim the worst 5
sorted_idx = sorted_idx[: -5]
# swap the features list with the trimmed features
temp = features_to_use
features_to_use=[]
for feat in sorted_idx:
features_to_use.append(temp[feat])
# add the logloss performance
row['logloss']=[log_loss(test['OpenStatusMod'].values, new_probs)]
print('')
# add the row to the dataframe
trim_5_df = trim_5_df.append(DataFrame(row))
run +=1
因此,我在这里做的是,我有一个功能列表,我想训练,然后预测,使用功能重要性,然后修剪最差的5个,然后重复。在每次运行期间,我都会添加一行来记录预测性能,以便以后进行一些分析
最初的代码要大得多,我正在分析不同的分类器和数据集,但我希望您能从上面了解情况。我注意到,对于random forest,每次运行时删除的功能数量会影响性能,因此每次修剪1、3和5个功能会产生不同的最佳功能集
我发现使用GradientBoostingClassifier更具可预测性和可重复性,因为无论我一次修剪1个特征还是3个或5个特征,最终的最佳特征集都是一致的
我希望我不是在教你吃蛋,你可能知道的比我多,但我的烧蚀分析方法是使用快速分类器粗略了解最佳特征集,然后使用性能更好的分类器,然后开始超参数调整,当我感觉到最好的参数是什么时,我会再次做粗粒玉米片,然后再做细粒玉米片。以下是我的想法。这是一个非常简单的解决方案,并且依赖于一个自定义的精度度量(称为weightedAccuracy),因为我正在对一个高度不平衡的数据集进行分类。但是,如果需要,它应该很容易变得更具可扩展性
from sklearn import datasets
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix
def get_enhanced_confusion_matrix(actuals, predictions, labels):
""""enhances confusion_matrix by adding sensivity and specificity metrics"""
cm = confusion_matrix(actuals, predictions, labels = labels)
sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
return cm, sensitivity, specificity, weightedAccuracy
iris = datasets.load_iris()
x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pandas.Series(iris.target, name='target')
response, _ = pandas.factorize(y)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
print "building the first forest"
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
}).sort(['imp'], ascending = False).reset_index(drop = True)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
numFeatures = len(x.columns)
rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]})
print "running RFE on %d features"%numFeatures
for i in range(1,numFeatures,1):
varsUsed = importances['name'][0:i]
print "now using %d of %s features"%(len(varsUsed), numFeatures)
xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
print("\n"+str(cm))
print('the sensitivity is %d percent'%(sensitivity * 100))
print('the specificity is %d percent'%(specificity * 100))
print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
rfeMatrix = rfeMatrix.append(
pandas.DataFrame({'numFeatures':[len(varsUsed)],
'weightedAccuracy':[weightedAccuracy],
'sensitivity':[sensitivity],
'specificity':[specificity]}), ignore_index = True)
print("\n"+str(rfeMatrix))
maxAccuracy = rfeMatrix.weightedAccuracy.max()
maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()
print "the final features used are %s"%featuresUsed
以下是我为使RandomForestClassifier适应RFECV所做的工作:
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
如果使用“准确度”或“f1”分数,只需使用此类即可。对于“roc_auc”,RFECV抱怨不支持多类格式。将其改为两级分类,代码如下,“roc_auc”评分有效。(使用Python 3.4.1和scikit学习0.15.1)
插入您的代码:
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=(pd.Series(iris.target, name='target')==2).astype(int)
rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2)
selector=rfecv.fit(x, y)
我提交了添加
coef\uuu
的请求,因此RandomForestClassifier
可以与RFECV
一起使用。然而,已经做出了改变。此更改将出现在版本0.17中
如果您现在想使用最新的开发版本,您可以使用它。另一种方法是在调用
predict
或predict\u proba
后使用feature\u importances\u
属性,这将按照传递的顺序返回一个百分比数组。看那把锯子;不过,我想知道是否有什么东西可以让我进行10倍的验证,并确定功能的最佳子集。我必须做类似的事情,但我是手动完成的,方法是对功能的重要性进行排序,然后一次修剪1、3或5个功能。我不得不说,我没有使用你的方法,所以我不知道是否可以做到。你能分享你的手动方法吗?我明天早上会发布我的代码,我的代码在我的工作电脑上,所以大约在英国夏令时上午8点左右
from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV
class RandomForestClassifierWithCoef(RandomForestClassifier):
def fit(self, *args, **kwargs):
super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
self.coef_ = self.feature_importances_
iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=(pd.Series(iris.target, name='target')==2).astype(int)
rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2)
selector=rfecv.fit(x, y)