R 带有调整和交叉验证的随机林优化_R_Machine Learning_Data Mining_Random Forest

R 带有调整和交叉验证的随机林优化

r machine-learning

R 带有调整和交叉验证的随机林优化,r,machine-learning,data-mining,random-forest,R,Machine Learning,Data Mining,Random Forest,我正在处理一个大型数据集，所以希望删除无关变量，并为每个分支优化m个变量。在R中，有两种方法，rfcv和tuneRF，可以帮助完成这两项任务。我正在尝试将它们结合起来以优化参数 rfcv的工作原理大致如下： create random forest and extract each variable's importance; while (nvar > 1) { remove the k (or k%) least important variables; run ran

我正在处理一个大型数据集，所以希望删除无关变量，并为每个分支优化m个变量。在R中，有两种方法，rfcv和tuneRF，可以帮助完成这两项任务。我正在尝试将它们结合起来以优化参数

rfcv的工作原理大致如下：

create random forest and extract each variable's importance;
while (nvar > 1) {
    remove the k (or k%) least important variables;
    run random forest with remaining variables, reporting cverror and predictions
}

create random forest and extract each variable's importance;
while (nvar > 1) {
    remove the k (or k%) least important variables;
    tune for the best m for reduced variable set;
    run random forest with remaining variables, reporting cverror and predictions;
}

目前，我已将rfcv重新编码为如下所示：

create random forest and extract each variable's importance;
while (nvar > 1) {
    remove the k (or k%) least important variables;
    run random forest with remaining variables, reporting cverror and predictions
}

create random forest and extract each variable's importance;
while (nvar > 1) {
    remove the k (or k%) least important variables;
    tune for the best m for reduced variable set;
    run random forest with remaining variables, reporting cverror and predictions;
}

当然，这会将运行时间增加一个数量级。我的问题是这有多必要（使用玩具数据集很难得出一个想法），以及是否有任何其他方法可以在短得多的时间内大致起作用。

与往常一样，答案取决于数据。一方面，如果没有任何不相关的特性，那么您可以完全跳过特性消除。随机林实现中的树构建过程已经尝试选择预测特性，这为您提供了一些针对不相关特性的保护

Leo Breiman在一次演讲中介绍了1000个不相关的特征到一些医学预测任务中，这些任务只有少数来自输入域的真实特征。当他使用单一的可变重要性过滤器消除90%的特征时，随机森林的下一次迭代没有在树中选择任何不相关的特征作为预测因子