Algorithm Scikit仅基于网格分数了解RFECV功能数量
从scikit学习中,该算法依次选择较小的特征集,并且仅保留具有最高权重的特征。具有低权重的特征将被删除,并且此过程将重复自身,直到剩余的特征数与用户指定的特征数匹配(或者默认情况下为原始特征数的一半) 结果表明,特征采用RFE和KFCV进行排名 代码中有一组25个功能,如下所示: 以下是我得到的输出:Algorithm Scikit仅基于网格分数了解RFECV功能数量,algorithm,python-2.7,machine-learning,scikit-learn,Algorithm,Python 2.7,Machine Learning,Scikit Learn,从scikit学习中,该算法依次选择较小的特征集,并且仅保留具有最高权重的特征。具有低权重的特征将被删除,并且此过程将重复自身,直到剩余的特征数与用户指定的特征数匹配(或者默认情况下为原始特征数的一半) 结果表明,特征采用RFE和KFCV进行排名 代码中有一组25个功能,如下所示: 以下是我得到的输出: Original number of features is 25 RFE final number of features : 12 RFECV final number of feature
Original number of features is 25
RFE final number of features : 12
RFECV final number of features : 3
Printing RFECV results:
1. Number of features: 3; Grid_Score: 0.818041
2. Number of features: 4; Grid_Score: 0.816065
3. Number of features: 5; Grid_Score: 0.816053
4. Number of features: 6; Grid_Score: 0.799107
5. Number of features: 7; Grid_Score: 0.797047
6. Number of features: 8; Grid_Score: 0.783034
7. Number of features: 10; Grid_Score: 0.783022
8. Number of features: 9; Grid_Score: 0.781992
9. Number of features: 11; Grid_Score: 0.778028
10. Number of features: 12; Grid_Score: 0.774052
11. Number of features: 14; Grid_Score: 0.762015
12. Number of features: 13; Grid_Score: 0.760075
13. Number of features: 15; Grid_Score: 0.752003
14. Number of features: 16; Grid_Score: 0.750015
15. Number of features: 18; Grid_Score: 0.750003
16. Number of features: 22; Grid_Score: 0.748039
17. Number of features: 17; Grid_Score: 0.746003
18. Number of features: 19; Grid_Score: 0.739105
19. Number of features: 20; Grid_Score: 0.739021
20. Number of features: 21; Grid_Score: 0.738003
21. Number of features: 23; Grid_Score: 0.729068
22. Number of features: 25; Grid_Score: 0.725056
23. Number of features: 24; Grid_Score: 0.725044
24. Number of features: 2; Grid_Score: 0.506952
25. Number of features: 1; Grid_Score: 0.272896
在这个特定的例子中:
在RFECV中,是否对修剪后剩余的特征子集进行交叉验证?如果是这样,在RFECV中每次删减后会保留多少功能?在交叉验证版本中,在每个步骤中会对功能重新排序,并删除排名最低的功能——这在文档中称为“递归功能选择”
如果要将其与原始版本进行比较,则需要计算RFE所选特性的交叉验证分数。我的猜测是RFECV的答案是正确的——从特征减少时模型性能的急剧增加判断,您可能有一些高度相关的特征,这些特征正在损害模型的性能。1。交叉验证是在修剪特征后完成的吗?例如,在被截断的特征集上?还有,为什么它只删除一个功能-这是一个规则吗?2.所谓相关特征,是指线性相关特征吗?或者特征之间是否存在其他类型的相关性?是的,我们应该预期3个特征。我所使用的代码只是来自文档示例,其中的目的是使用3个信息特性进行分类。但是很明显,原生RFE并没有给出这一点——如果我们事先不知道有多少功能,那么我们将使用RFE,并在最后保留默认数量的功能(本例中为12个)。在我看来,只有RFECV才能给出正确的答案……也许这只是本地RFE的一个重大限制——即,它无法消除相关功能,因此这种情况下需要GridScore?是的,交叉验证是在修剪之后完成的,在
RFECV
中有一个名为step
的参数,它指定在每个步骤中要删除多少个功能,默认为1。我认为它可能是任何类型的相关性,尽管我不是100%的了解不同类型的相关性将如何影响模型。对于你的另一个问题,交叉验证基本上是做统计的“现代”方式,其他像标准RFE这样的heruistic方法存在的原因是因为过去没有可用的计算能力。但是现在计算机便宜了,您几乎应该总是更喜欢交叉验证(或其他一些列车测试分离)来评估模型性能!这回答了我的问题。
Original number of features is 25
RFE final number of features : 12
RFECV final number of features : 3
Printing RFECV results:
1. Number of features: 3; Grid_Score: 0.818041
2. Number of features: 4; Grid_Score: 0.816065
3. Number of features: 5; Grid_Score: 0.816053
4. Number of features: 6; Grid_Score: 0.799107
5. Number of features: 7; Grid_Score: 0.797047
6. Number of features: 8; Grid_Score: 0.783034
7. Number of features: 10; Grid_Score: 0.783022
8. Number of features: 9; Grid_Score: 0.781992
9. Number of features: 11; Grid_Score: 0.778028
10. Number of features: 12; Grid_Score: 0.774052
11. Number of features: 14; Grid_Score: 0.762015
12. Number of features: 13; Grid_Score: 0.760075
13. Number of features: 15; Grid_Score: 0.752003
14. Number of features: 16; Grid_Score: 0.750015
15. Number of features: 18; Grid_Score: 0.750003
16. Number of features: 22; Grid_Score: 0.748039
17. Number of features: 17; Grid_Score: 0.746003
18. Number of features: 19; Grid_Score: 0.739105
19. Number of features: 20; Grid_Score: 0.739021
20. Number of features: 21; Grid_Score: 0.738003
21. Number of features: 23; Grid_Score: 0.729068
22. Number of features: 25; Grid_Score: 0.725056
23. Number of features: 24; Grid_Score: 0.725044
24. Number of features: 2; Grid_Score: 0.506952
25. Number of features: 1; Grid_Score: 0.272896