将嵌套和滚动从R中的Tidymodels中分离出来

将嵌套和滚动从R中的Tidymodels中分离出来,r,nested,cross-validation,rolling-computation,tidymodels,R,Nested,Cross Validation,Rolling Computation,Tidymodels,我正在尝试使用Tidymodels套件中的rolling_origin来训练一个随机森林。我希望褶皱正好是一年中的月份。嵌套看起来可以做到这一点,但当数据嵌套时,tune_grid无法找到变量。我怎样才能做到这一点?我在下面举了一个可复制的例子 suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(tidymodels)) suppressPackageStartupM

我正在尝试使用Tidymodels套件中的rolling_origin来训练一个随机森林。我希望褶皱正好是一年中的月份。嵌套看起来可以做到这一点,但当数据嵌套时,tune_grid无法找到变量。我怎样才能做到这一点?我在下面举了一个可复制的例子


suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidymodels))
suppressPackageStartupMessages(library(yardstick))

# Create dummy data ====================================================================================================

dates <- seq(from = as.Date("2019-01-01"), to = as.Date("2019-12-31"), by = 'day' )
l <- length(dates)

set.seed(1)
data_set <- data.frame(
  date = dates,
  v1 = rnorm(l),
  v2 = rnorm(l),
  v3 = rnorm(l),
  y = rnorm(l)
)

# Random Forest Model  =================================================================================================

model <-
  parsnip::rand_forest(
    mode = "regression",
    trees = tune()) %>%
  set_engine("ranger")

# grid specification
params <-
  dials::parameters(
    trees()
  )

# Set up grid and model workflow =======================================================================================

grid <-
  dials::grid_max_entropy(
    params,
    size = 2
  )

form <- as.formula(paste("y ~ v1 + v2 + v3"))

model_workflow <-
  workflows::workflow() %>%
  add_model(model) %>%
  add_formula(form)

# Tuning on the normal data set works ====================================================================================================

data_ro_day <- data_set %>%
  rolling_origin(
    initial = 304,
    assess = 30,
    cumulative = TRUE,
    skip = 30
  )

results <- tune_grid(
  model_workflow,
  grid = grid,
  resamples = data_ro_day,
  param_info = params,
  metrics   = metric_set(mae, mape, rmse, rsq),
  control   = control_grid(verbose = TRUE))

results %>% show_best("mape", n = 2)

# Tuning on the nested data set doesn't work =========================================================================================

data_ro_month <- data_set %>%
  mutate(year_month = format(date, "%Y-%m")) %>%
  nest(-year_month) %>%
  rolling_origin(
    initial = 10,
    assess = 1,
    cumulative = TRUE
  )

results <- tune_grid(
    model_workflow,
    grid = grid,
    resamples = data_ro_month,
    param_info = params,
    metrics   = metric_set(mae, mape, rmse, rsq),
    control   = control_grid(verbose = TRUE))

results$.notes ```

SuppressPackageStatupMessages(库(tidyverse))
SuppressPackageStatupMessages(库(tidymodels))
SuppressPackageStatupMessages(库(尺度))
#创建虚拟数据====================================================================================================

日期我不完全清楚如何分割数据进行调优,但我建议您研究其他一些rsample函数,如
滑动窗口()
,尤其是
滑动时段()
。它们允许您创建用于调整的实验设计,您可以在其中适应特定月份的数据,然后在另一个月份进行评估,并在您可用的所有月份进行滑动:

库(tidymodels)
日期分割id
#>               
#>1切片1
#>2切片2
#>3片3
#>4.4.4
#>5片5
#>6.6
#>7片7
我在这里使用了
skip=4
,只保留了可以获得更多训练数据的切片。每个切片都将在几个月的数据基础上进行培训,并在上个月的新数据基础上进行评估。重采样将在数据集中向前滑动。因为我使用了
lookback=Inf
它总是包含所有过去的数据,但是您可以更改它

当您设置了适合您的域问题的重采样方法后,您可以制定模型规范并对其进行调整:

rf\u规格%
设置引擎(“游骑兵”)
rf_wf%
添加_型号(rf_规格)%%>%
添加_公式(y~v1+v2+v3)
调整网格(rf\U wf,重采样=月折叠)
#>#调整结果
#>#滑动周期重采样
#>#tibble:7 x 4
#>拆分id.metrics.notes
#>                                      
#>1切片1
#>2切片2
#>3片3
#>4.4.4
#>5片5
#>6.6
#>7片7

由(v0.3.0.9001)于2020-11-15创建

谢谢Julia,我不知道滑动周期,这很有效!