Tidymodels（使用fit_samples（）拟合随机林）：Fold01:internal:Error:Must group by在`.data中找到的变量`_R_Machine Learning_Regression_Random Forest_Tidymodels

Tidymodels（使用fit_samples（）拟合随机林）：Fold01:internal:Error:Must group by在`.data中找到的变量`

r machine-learning

Tidymodels（使用fit_samples（）拟合随机林）：Fold01:internal:Error:Must group by在`.data中找到的变量`,r,machine-learning,regression,random-forest,tidymodels,R,Machine Learning,Regression,Random Forest,Tidymodels,概述我已经生成了一个随机森林回归模型，我的目标是使用函数fit_samples（）来拟合模型，然后调整超参数。但是，我遇到以下错误消息：错误消息： ! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u... x Fold01: internal: Error: Must group by variables found in `.dat

概述

我已经生成了一个随机森林回归模型，我的目标是使用函数fit_samples（）来拟合模型，然后调整超参数。但是，我遇到以下错误消息：

错误消息：

   ! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold01: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold02: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold03: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

我在网上搜索了一个解决方案，但我找不到一个与我的特定问题相一致的问题。我不是一个高级的R用户，我正在尽我最大的努力通过Tidymodels包慢慢地操纵自己

如果有人能帮我处理这个错误信息，我将不胜感激

非常感谢

R代码

   seed(45L)

   #Open libraries
   library(tidymodels)
   library(ranger)
   library(dplyr)

   #split this single dataset into two: a training set and a testing set
   data_split <- initial_split(FID)
   #Create data frames for the two sets:
   train_data <- training(data_split)
   test_data  <- testing(data_split)

  #resample the data with 10-fold cross-validation (10-fold by default)
  cv <- vfold_cv(train_data, v=10)

 ###########################################################
 ##Produce the recipe

  rec <- recipe(Frequency ~ ., data = FID) %>% 
  step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
  step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels 
  step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars"))  %>% # replaces missing numeric observations with the median
  step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables

  #Produce the random forest model

       mod_rf <- rand_forest(
                            mtry = tune(),
                            trees = 1000,
                            min_n = tune()
                             ) %>%
                           set_mode("regression") %>%
                           set_engine("ranger")  

   ##Workflow
      wflow_rf <- workflow() %>% 
                            add_model(mod_rf) %>% 
                                        add_recipe(rec)

    ##Fit model

     plan(multisession)

     fit_rf<-fit_resamples(
                        wflow_rf,
                        cv,
                        metrics = metric_set(rmse, rsq),
                        control = control_resamples(save_pred = TRUE,
                        extract = function(x) extract_model(x)))

   #Error Message

   ! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold01: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold02: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

   ! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
   x Fold03: internal: Error: Must group by variables found in `.data`.
   * Column `mtry` is not found.

如果您查看“帮助”页面以了解：

fit_resamples（）计算一个或多个系统的一组性能指标更多重采样。它不执行任何调整（请参见tune_grid（）和 tune_bayes（）用于此）

很可能需要先进行优化，然后使用从优化中获得的参数运行fit_resamples（），例如：

rf_grid <- expand.grid(mtry = 2:4, min_n = c(10,15,20))

mod_rf <- rand_forest(
                      mtry = tune(),
                      trees = 1000,
                      min_n = tune()
                      ) %>%
                      set_mode("regression") %>%
                      set_engine("ranger")  

wflow_rf <- workflow() %>% 
            add_model(mod_rf) %>% 
            add_recipe(rec)

rf_res <- 
  wflow_rf %>% 
  tune_grid(
    resamples = cv,grid = rf_grid
    )

rf\u网格%
设置引擎（“游骑兵”）
wflow_rf%
添加型号（模块rf）%>%
添加配方（rec）
射频分辨率%
调谐网格(
重采样=cv，栅格=rf\U栅格
)

找到最佳参数：

show_best(rf_res,metric="rmse")
# A tibble: 5 x 7
   mtry min_n .metric .estimator  mean     n std_err
  <int> <dbl> <chr>   <chr>      <dbl> <int>   <dbl>
1     4    10 rmse    standard    7.87    10   0.743
2     4    15 rmse    standard    8.27    10   0.649
3     3    10 rmse    standard    8.49    10   0.682
4     3    15 rmse    standard    8.97    10   0.620
5     4    20 rmse    standard    9.49    10   0.605

show\u best（rf\u res，metric=“rmse”）
#一个tibble:5x7
mtry最小度量估计平均标准误差
1 4 10 rmse标准7.87 10 0.743
2 4 15 rmse标准8.27 10 0.649
3 3 10 rmse标准8.49 10 0.682
4 3 15 rmse标准8.97 10 0.620
5 4 20 rmse标准9.49 10 0.605

然后再次运行它：

mod_rf <- rand_forest(mtry = 4,trees = 1000,min_n = 10) %>%
          set_mode("regression") %>%
          set_engine("ranger")  

wflow_rf <- workflow() %>% 
            add_model(mod_rf) %>% 
            add_recipe(rec)

fit_rf<-fit_resamples(
                    wflow_rf,
                    cv,
                    metrics = metric_set(rmse, rsq),
                    control = control_resamples(save_pred = TRUE,
                    extract = function(x) extract_model(x)))

mod_rf%
设置_模式（“回归”）%>%
设置引擎（“游骑兵”）
wflow_rf%
添加型号（模块rf）%>%
添加配方（rec）
适合他们愚蠢的狼。非常感谢你的帮助。几天来我一直在努力解决我的问题，结果弄糊涂了。在我的代码的另一点上，我创建了一个图形，x轴上有树深，y轴上有平均rmse和rsq值（在一个图形中生成两个图）。有没有办法将tree_depth（）合并回随机森林模型？很抱歉问更多的问题，我希望你不要认为我越界了，但我已经为我的其他三个模型制作了这个情节。我非常感谢你的帮助。当我使用函数collect_metrics（）时，我以前的代码提取的树深度不正确。Hi@AliceHobbs，没有问题，因此您希望调整树深度（在ranger中为max.depth）或者，您希望从最终模型中获得单个树的深度。可能是单个树的深度，因为当我使用tune_grid（）Hmmm从优化模型中使用collect_metrics（）时，绘图是用它们的平均rsme和rsq绘制所有树。试图澄清问题。我猜您正在与其他基于树的模型（例如rpart或gbm）进行比较，这些模型使用tree_depth（）作为调整参数。您的问题是ranger是否可以选择调整此。。
mod_rf <- rand_forest(mtry = 4,trees = 1000,min_n = 10) %>%
          set_mode("regression") %>%
          set_engine("ranger")  

wflow_rf <- workflow() %>% 
            add_model(mod_rf) %>% 
            add_recipe(rec)

fit_rf<-fit_resamples(
                    wflow_rf,
                    cv,
                    metrics = metric_set(rmse, rsq),
                    control = control_resamples(save_pred = TRUE,
                    extract = function(x) extract_model(x)))