Tidymodels(使用fit_samples()拟合随机林):Fold01:internal:Error:Must group by在`.data中找到的变量`
概述 我已经生成了一个随机森林回归模型,我的目标是使用函数fit_samples()来拟合模型,然后调整超参数。但是,我遇到以下错误消息: 错误消息:Tidymodels(使用fit_samples()拟合随机林):Fold01:internal:Error:Must group by在`.data中找到的变量`,r,machine-learning,regression,random-forest,tidymodels,R,Machine Learning,Regression,Random Forest,Tidymodels,概述 我已经生成了一个随机森林回归模型,我的目标是使用函数fit_samples()来拟合模型,然后调整超参数。但是,我遇到以下错误消息: 错误消息: ! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u... x Fold01: internal: Error: Must group by variables found in `.dat
! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold01: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold02: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold03: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
我在网上搜索了一个解决方案,但我找不到一个与我的特定问题相一致的问题。我不是一个高级的R用户,我正在尽我最大的努力通过Tidymodels包慢慢地操纵自己
如果有人能帮我处理这个错误信息,我将不胜感激
非常感谢
R代码
seed(45L)
#Open libraries
library(tidymodels)
library(ranger)
library(dplyr)
#split this single dataset into two: a training set and a testing set
data_split <- initial_split(FID)
#Create data frames for the two sets:
train_data <- training(data_split)
test_data <- testing(data_split)
#resample the data with 10-fold cross-validation (10-fold by default)
cv <- vfold_cv(train_data, v=10)
###########################################################
##Produce the recipe
rec <- recipe(Frequency ~ ., data = FID) %>%
step_nzv(all_predictors(), freq_cut = 0, unique_cut = 0) %>% # remove variables with zero variances
step_novel(all_nominal()) %>% # prepares test data to handle previously unseen factor levels
step_medianimpute(all_numeric(), -all_outcomes(), -has_role("id vars")) %>% # replaces missing numeric observations with the median
step_dummy(all_nominal(), -has_role("id vars")) # dummy codes categorical variables
#Produce the random forest model
mod_rf <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
##Workflow
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
##Fit model
plan(multisession)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
#Error Message
! Fold01: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold01: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold02: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold02: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
! Fold03: model: tune columns were requested but there were 14 predictors in the data. 14 will be u...
x Fold03: internal: Error: Must group by variables found in `.data`.
* Column `mtry` is not found.
如果您查看“帮助”页面以了解: fit_resamples()计算一个或多个系统的一组性能指标 更多重采样。它不执行任何调整(请参见tune_grid()和 tune_bayes()用于此) 很可能需要先进行优化,然后使用从优化中获得的参数运行fit_resamples(),例如:
rf_grid <- expand.grid(mtry = 2:4, min_n = c(10,15,20))
mod_rf <- rand_forest(
mtry = tune(),
trees = 1000,
min_n = tune()
) %>%
set_mode("regression") %>%
set_engine("ranger")
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
rf_res <-
wflow_rf %>%
tune_grid(
resamples = cv,grid = rf_grid
)
rf\u网格%
设置引擎(“游骑兵”)
wflow_rf%
添加型号(模块rf)%>%
添加配方(rec)
射频分辨率%
调谐网格(
重采样=cv,栅格=rf\U栅格
)
找到最佳参数:
show_best(rf_res,metric="rmse")
# A tibble: 5 x 7
mtry min_n .metric .estimator mean n std_err
<int> <dbl> <chr> <chr> <dbl> <int> <dbl>
1 4 10 rmse standard 7.87 10 0.743
2 4 15 rmse standard 8.27 10 0.649
3 3 10 rmse standard 8.49 10 0.682
4 3 15 rmse standard 8.97 10 0.620
5 4 20 rmse standard 9.49 10 0.605
show\u best(rf\u res,metric=“rmse”)
#一个tibble:5x7
mtry最小度量估计平均标准误差
1 4 10 rmse标准7.87 10 0.743
2 4 15 rmse标准8.27 10 0.649
3 3 10 rmse标准8.49 10 0.682
4 3 15 rmse标准8.97 10 0.620
5 4 20 rmse标准9.49 10 0.605
然后再次运行它:
mod_rf <- rand_forest(mtry = 4,trees = 1000,min_n = 10) %>%
set_mode("regression") %>%
set_engine("ranger")
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))
mod_rf%
设置_模式(“回归”)%>%
设置引擎(“游骑兵”)
wflow_rf%
添加型号(模块rf)%>%
添加配方(rec)
适合他们愚蠢的狼。非常感谢你的帮助。几天来我一直在努力解决我的问题,结果弄糊涂了。在我的代码的另一点上,我创建了一个图形,x轴上有树深,y轴上有平均rmse和rsq值(在一个图形中生成两个图)。有没有办法将tree_depth()合并回随机森林模型?很抱歉问更多的问题,我希望你不要认为我越界了,但我已经为我的其他三个模型制作了这个情节。我非常感谢你的帮助。当我使用函数collect_metrics()时,我以前的代码提取的树深度不正确。Hi@AliceHobbs,没有问题,因此您希望调整树深度(在ranger中为max.depth)或者,您希望从最终模型中获得单个树的深度。可能是单个树的深度,因为当我使用tune_grid()Hmmm从优化模型中使用collect_metrics()时,绘图是用它们的平均rsme和rsq绘制所有树。试图澄清问题。我猜您正在与其他基于树的模型(例如rpart或gbm)进行比较,这些模型使用tree_depth()作为调整参数。您的问题是ranger是否可以选择调整此。。
mod_rf <- rand_forest(mtry = 4,trees = 1000,min_n = 10) %>%
set_mode("regression") %>%
set_engine("ranger")
wflow_rf <- workflow() %>%
add_model(mod_rf) %>%
add_recipe(rec)
fit_rf<-fit_resamples(
wflow_rf,
cv,
metrics = metric_set(rmse, rsq),
control = control_resamples(save_pred = TRUE,
extract = function(x) extract_model(x)))