如何在dplyr+；呼噜声_R_Dplyr_Tidyverse_Purrr

如何在dplyr+；呼噜声

如何在dplyr+；呼噜声,r,dplyr,tidyverse,purrr,R,Dplyr,Tidyverse,Purrr,我有一个问题，在复制了培训和测试集的数据后，我在Rstudio中显示了分配给用户的大量内存，但没有在R会话中使用。我创建了一个小示例来重现我的情况：）这段代码基于我给出的不同公式、算法和参数集运行一系列模型。这是一个函数，但我为reprex创建了一个简单的脚本 library(dplyr) library(purrr) library(modelr) library(tidyr) library(pryr) # set my inputs data <- mtcars formulas

我有一个问题，在复制了培训和测试集的数据后，我在Rstudio中显示了分配给用户的大量内存，但没有在R会话中使用。我创建了一个小示例来重现我的情况：）

这段代码基于我给出的不同公式、算法和参数集运行一系列模型。这是一个函数，但我为reprex创建了一个简单的脚本

library(dplyr)
library(purrr)
library(modelr)
library(tidyr)
library(pryr)

# set my inputs
data <- mtcars
formulas <- c(test1 = mpg ~ cyl + wt + hp,
              test2 = mpg ~ cyl + wt)
params = list()
methods <- "lm"

n <- 20 # num of cv splits
mult <- 10 # number of times I want to replicate some of the data
frac <- .25 # how much I want to cut down other data (fractional)

### the next few chunks get the unique combos of the inputs.
if (length(params) != 0) {
  cross_params <- params %>% 
    map(cross) %>% 
    map_df(enframe, name = "param_set", .id = "method") %>% 
    list
} else cross_params <- NULL

methods_df <- tibble(method = methods) %>% 
  list %>% 
  append(cross_params)  %>% 
  reduce(left_join, by = "method") %>% 
  split(1:nrow(.))

# wrangle formulas into a split dataframe
formulas_df <- tibble(formula = formulas,
                      name = names(formulas)) %>% 
  split(.$name)

# split out the data into n random train-test combos
cv_data <- data %>% 
  crossv_kfold(n) %>% # rsample?
  mutate_at(vars(train:test), ~map(.x, as_tibble))

# sample out if needed
cv_data_samp <- cv_data %>%
  mutate(train = modify(train, 
                        ~ .x %>% 
                          split(.$gear == 4) %>% 
                          # take a sample of the non-vo data
                          modify_at("FALSE", sample_frac, frac) %>% 
                          # multiply out the vo-on data
                          modify_at("TRUE", function(.df) {
                            map_df(seq_along(1:mult), ~ .df) 
                          }) %>% 
                          bind_rows))

# get all unique combos of formula and method
model_combos <- list(cv = list(cv_data_samp), 
                     form = formulas_df, 
                     meth = methods_df) %>% 
  cross %>%
  map_df(~ bind_cols(nest(.x$cv), .x$form, .x$meth)) %>% 
  unnest(data, .preserve = matches("formula|param|value")) %>% 
  {if ("value" %in% names(.)) . else mutate(., value = list(NULL))} 

# run the models
model_combos %>% 
  # put all arguments into a single params column
  mutate(params = pmap(list(formula = formula, data = train), list)) %>%
  mutate(params = map2(params, value, ~ append(.x, .y))) %>%
  mutate(params = modify(params, discard, is.null)) %>%
  # run the models
  mutate(model = invoke_map(method, params))  

mem_change(rm(data, cv_data, cv_data_samp))
mem_used()

库（dplyr）
图书馆（purrr）
库（建模器）
图书馆（tidyr）
图书馆（普赖尔）
#设置我的输入
数据%
分割（.$gear==4）%>%
#对非vo数据进行采样
修改（错误），样本分数，分数）%>%
#乘以数据上的vo
在函数（.df）的“真”处修改{
地图测向（沿（1:mult）的顺序，~.df）
}) %>% 
绑定（行）
#获取公式和方法的所有唯一组合
型号组合%
交叉%>%
map_df（~bind_cols（nest（.x$cv），.x$form，.x$meth））%>%
unnest（数据，.preserve=matches（“公式|参数|值”））%>%
{if（“值”%in%names（.）。else mutate（，value=list（NULL））}
#运行模型
型号组合%>%
#将所有参数放入单个参数列中
突变（参数=pmap（列表（公式=公式，数据=序列），列表））%>%
mutate（params=map2（params，value，~append（.x，.y）））%>%
mutate（params=modify（params，discard，is.null））%>%
#运行模型
变异（模型=调用映射（方法，参数））
mem_变更（rm（数据、cv_数据、cv_数据、samp））
mem_used（）

现在，在我这样做之后，我的

mem_used

显示为77.3mb，但我看到分配给我的R用户的内存（160Mb）大约是原来的两倍。当我的数据为3 Gb时，这真的会爆炸，这是我的真实情况。我最终使用了100Gb并占用了整个服务器：（

发生了什么以及如何优化

感谢任何帮助！！！

我明白了！问题是我正在将我的

modeler

resample

系列对象转换为

tibble

s，这会爆炸内存，即使我随后对它们进行了采样。解决方案是编写处理

resample

对象的方法，这样我就不会er必须将

重采样

对象转换为

TIBLE

。这些看起来像：

# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
  data$idx <- sample(data$idx, frac * length(data$idx))
  data
}

# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
  data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
  data
}

# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
  pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
  neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
  class(pos) <- "resample"
  class(neg) <- "resample"
  list("TRUE" = pos,
       "FALSE" = neg)
}

# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
  out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
  class(out) <- "resample"
  out
}

#此函数仅对索引而不是数据进行采样
sample\u frac.resample我发现了这个问题！问题是我正在将我的modeler
resample
对象系列转换为tibble
s，这会爆炸内存，即使我随后对它们进行了采样。解决方案是编写处理resample
对象的方法，这样我就永远不会o将重采样
对象转换为tibble
。它们看起来像：
# this function just samples the indexes instead of the data
sample_frac.resample <- function(data, frac) {
  data$idx <- sample(data$idx, frac * length(data$idx))
  data
}

# this function replicates the indexes. I should probably call it something else.
augment.resample <- function(data, n) {
  data$idx <- unlist(map(seq_along(1:n), ~ data$idx))
  data
}

# This function does simple splitting (logical only) of resample obejcts
split.resample <- function(data, .p) {
  pos <- list(data = data$data, idx = which(.p, 1:nrow(data$data)))
  neg <- list(data = data$data, idx = which(!.p, 1:nrow(data$data)))
  class(pos) <- "resample"
  class(neg) <- "resample"
  list("TRUE" = pos,
       "FALSE" = neg)
}

# This function takes the equivalent of a `bind_rows` for resample objects.
# Since bind rows does not call `useMethod` I had to call it something else
bind <- function(data) {
  out <- list(data = data[[1]]$data, idx = unlist(map(data, pluck, "idx")))
  class(out) <- "resample"
  out
}

#此函数仅对索引而不是数据进行采样
样品压裂重取样