R 我可以使用随机林进行特征选择插入符号吗_R_Random Forest_R Caret_Feature Selection

R 我可以使用随机林进行特征选择插入符号吗

R 我可以使用随机林进行特征选择插入符号吗,r,random-forest,r-caret,feature-selection,R,Random Forest,R Caret,Feature Selection,我正在使用caret R软件包进行模型培训，我对机器学习完全陌生我不知道我是否可以使用下面的想法来进行特征选择和模型训练我的代码如下所示：其想法是，首先我将在train函数中运行random forest，然后我将选择前20个重要特征（基于varImp函数），并根据这些前20个特征重新训练数据我不确定这种方法是否有效？ 1.首先，我将训练一个具有所有特性的随机森林模型 ctrl <- trainControl(method="cv",

我正在使用caret R软件包进行模型培训，我对机器学习完全陌生

我不知道我是否可以使用下面的想法来进行特征选择和模型训练

我的代码如下所示：

其想法是，首先我将在train函数中运行random forest，然后我将选择前20个重要特征（基于varImp函数），并根据这些前20个特征重新训练数据

我不确定这种方法是否有效？ 1.首先，我将训练一个具有所有特性的随机森林模型

ctrl <- trainControl(method="cv", 
                     number = 5,
                     summaryFunction=twoClassSummary, 
                     classProbs=T,
                     savePredictions = T)

##################################### train model 
set.seed(1234)
model <- 
  train(Bgroup ~ ., 
        data=all_data, 
        method="rf", preProc=c("center", "scale","nzv"), 
        trControl=ctrl,
        tuneLength = 10,
        metric = "ROC"
  )

#####################################  based on this model, I can get a plot of feature importance 

feature_importance <- varImp(model)


##################################### I only selected top20 important features 
importance_df <- data.frame(feature_importance$importance,feature = rownames(feature_importance$importance))
top20 <- head(importance_df[order(importance_df[,1],decreasing = T),],n=20) %>%
  .$feature %>%
  gsub("`", '', .)  


##################################### top20 data selection
data_top20 <- all_data[,top20] 
data_top20$group <- all_data$Bgroup

##################################### re-train model again based on these 20 features 
set.seed(1234)
model_top20 <- train(group ~ ., 
                                  data=data_top20, 
                                  method="rf", preProc=c("center", "scale","nzv"), 
                                  trControl=ctrl,
                                  tuneLength = 10,
                                  metric = "ROC"
)

### calculate performance 
a <- filter(data_top20$pred, mtry ==4)
confusionMatrix(a$pred,a$obs,positive = "positive")

ctrl这称为学习者重要性过滤器。在某些情况下，使用它可能是一个不错的选择。您实施它的方式将导致对模型性能的高估，因为您估计了整个训练集的重要性，然后使用重采样来估计功能缩减训练集的性能。您好，非常感谢！我只想知道哪个特性对训练数据集更重要。可以用吗？或者你认为rfe是一个更好或更高级的选择吗？哪种功能选择方法更好取决于问题（没有免费午餐）。rfe是特征选择的包装方法，而Rf重要性是特征选择的过滤方法。为了比较它们没有（很少）偏差，您很可能需要执行嵌套交叉验证，并以正确的方式使用过滤器（我的第一条评论指出了为什么您没有以正确的方式使用它们，从而导致数据泄漏，从而在性能估计中产生偏差）。