在执行中停止dplyr/tidyr链并保存计算进度_R_Dplyr_Tidyr_Tidyverse

在执行中停止dplyr/tidyr链并保存计算进度

在执行中停止dplyr/tidyr链并保存计算进度,r,dplyr,tidyr,tidyverse,R,Dplyr,Tidyr,Tidyverse,我有一个编写的自定义函数，在大型数据集上运行需要一段时间，有时会出现停顿。我的函数是一个窗口函数（例如，cumsum）。如果我停止执行，所有数据都会丢失。在tidyr和dplyr中是否有办法在数据传输过程中保存数据以避免这种情况我的数据是宽幅格式的，我在多个组（如产品）和多个变量（如指标）上运行函数如果我使用tidy方法，我只需收集数据，然后按分组即可。这是可行的，但我不能在中途停止执行而不失去所有进度 # The tidy way dt2 <- dt %>% gather(

我有一个编写的自定义函数，在大型数据集上运行需要一段时间，有时会出现停顿。我的函数是一个窗口函数（例如，

cumsum

）。如果我停止执行，所有数据都会丢失。在

tidyr

和

dplyr

中是否有办法在数据传输过程中保存数据以避免这种情况

我的数据是宽幅格式的，我在多个组（如产品）和多个变量（如指标）上运行函数

如果我使用tidy方法，我只需

收集数据，然后按分组即可。这是可行的，但我不能在中途停止执行而不失去所有进度
# The tidy way
dt2 <- dt %>%
  gather(Metric,Value,3:6) %>%
  group_by(Product,Metric) %>%
  mutate(Metric2 = paste0(Metric,2),
         Value2 = cumsum(Value)) %>%
  ungroup() %>%
  select(-Value, -Metric) %>% # I would love to leave the original metric in if possible
  spread(Metric2,Value2)

整洁的方式
dt2%
聚集（度量，值，3:6）%>%
分组单位（产品、公制）%>%
突变（Metric2=paste0（Metric，2），
Value2=总和（值））%>%
解组（）%>%
选择（-Value，-Metric）%>%#如果可能，我希望保留原始度量
排列（度量2，值2）

如果我不使用tidy方法，我可以随时停止执行，并保存到该点的结果
# The non-tidy way
dt2 <- tibble()
#pb = txtProgressBar(min = 0, max = 4, initial = 0, style = 3)
for(i in 1:4) {
  single_product <- dt[which(dt$Product == unique(dt$Product)[i]),]
  for(j in 3:6) {
    single_metric <- single_product[,c(1:2,j)]
    single_metric[,paste0(colnames(single_metric[3]),2)] <- cumsum(single_metric[3])
    single_product <- left_join(single_product,single_metric)
  }
  dt2 <- bind_rows(dt2,single_product)
  #setTxtProgressBar(pb,i)
}

#不整洁的方式
dt2我能想到的保存进度的最简单方法是使用缓存。在下面的代码中，memoize\u-fun
使用一个函数来计算值（value\u-fun
），以及一个函数来计算该值的键（key\u-fun
）。在本例中，键是产品，值是您要为该产品计算的完整数据帧。我添加了消息以显示何时填充和使用缓存。请注意，如果do
语句花费的时间超过几秒钟，dplyr应该自动添加进度条。您应该在第一次运行时看到这一点，而re运行时是通过调用Sys.sleep
人为膨胀的
library(dplyr)
library(tidyr)
library(magrittr)
dt <- expand.grid(Product=LETTERS, Metric = letters[1:4], Year = 2012:2016)
dt$Value <- rnorm(nrow(dt))
dt <- dt %>%
  spread(Metric, Value)


my_cache <- list()
memoize_fun <- function(value_fun,  key_fun) {
    function(...) {
        key <- as.character(key_fun(...))
        message("Using key", deparse(key))
        assert_that(is.character(key))
        assert_that(length(key) == 1)
        if (! key %in% names(my_cache)) {
            message("Computing value for ", deparse(key))
            my_cache[[key]] <<- value_fun(...)
            Sys.sleep(1)
        } else {
            message("Re-using stored value for ", deparse(key))
        }
        return (my_cache[[key]])
    }
}

metrics <- colnames(dt)[3:6]

system.time({
    dt2 <- dt %>%
        group_by(Product) %>%
        do({
            value_fun <- . %>% cbind(., CumSum=transmute_all(.[metrics], cumsum))
            key_fun <- . %>% .$Product %>% .[1]
            memoize_fun(value_fun, key_fun)(.)
        })
})

## Run the same thing again to demonstrate that everything is cached
system.time({
    dt2 <- dt %>%
        group_by(Product) %>%
        do({
            value_fun <- . %>% cbind(., CumSum=transmute_all(.[metrics], cumsum))
            key_fun <- . %>% .$Product %>% .[1]
            memoize_fun(value_fun, key_fun)(.)
        })
})

库（dplyr）
图书馆（tidyr）
图书馆（magrittr）
dt只是想补充一点，使用transmute_all
，mutate_at，或者类似于您所做的操作，与将表格从宽改长再宽相比，也将节省大量时间。
# The data
dt <- expand.grid(Product=LETTERS[1:4], Metric = letters[1:4], Year = 2012:2016)
dt$Value <- rnorm(nrow(dt))
dt <- dt %>%
  spread(Metric, Value)

library(dplyr)
library(tidyr)
library(magrittr)
dt <- expand.grid(Product=LETTERS, Metric = letters[1:4], Year = 2012:2016)
dt$Value <- rnorm(nrow(dt))
dt <- dt %>%
  spread(Metric, Value)


my_cache <- list()
memoize_fun <- function(value_fun,  key_fun) {
    function(...) {
        key <- as.character(key_fun(...))
        message("Using key", deparse(key))
        assert_that(is.character(key))
        assert_that(length(key) == 1)
        if (! key %in% names(my_cache)) {
            message("Computing value for ", deparse(key))
            my_cache[[key]] <<- value_fun(...)
            Sys.sleep(1)
        } else {
            message("Re-using stored value for ", deparse(key))
        }
        return (my_cache[[key]])
    }
}

metrics <- colnames(dt)[3:6]

system.time({
    dt2 <- dt %>%
        group_by(Product) %>%
        do({
            value_fun <- . %>% cbind(., CumSum=transmute_all(.[metrics], cumsum))
            key_fun <- . %>% .$Product %>% .[1]
            memoize_fun(value_fun, key_fun)(.)
        })
})

## Run the same thing again to demonstrate that everything is cached
system.time({
    dt2 <- dt %>%
        group_by(Product) %>%
        do({
            value_fun <- . %>% cbind(., CumSum=transmute_all(.[metrics], cumsum))
            key_fun <- . %>% .$Product %>% .[1]
            memoize_fun(value_fun, key_fun)(.)
        })
})

my_cache <- list() # Reset the cache
finished <- FALSE
tries <- 1
while (! finished) {
    message("Attempt number ", tries)
    tryCatch({
        dt2 <- dt %>%
            group_by(Product) %>%
            do({
                value_fun <- . %>% cbind(., CumSum=transmute_all(.[metrics], cumsum)) %T>%
                    { if (runif(1) > 0.5) stop("Random error")}
                key_fun <- . %>% .$Product %>% .[1]
                memoize_fun(value_fun, key_fun)(.)
            })
        finished <- TRUE
    },
    error=function(...) NULL)
    tries <- tries + 1
}