Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
为什么save和SAVERD在dopar中的行为不同?_R_Foreach_Machine Learning_Parallel Processing_R Caret - Fatal编程技术网

为什么save和SAVERD在dopar中的行为不同?

为什么save和SAVERD在dopar中的行为不同?,r,foreach,machine-learning,parallel-processing,r-caret,R,Foreach,Machine Learning,Parallel Processing,R Caret,(这是我第一次尝试创建一个可复制的示例问题-请随时用更好的方式来描述或说明问题!) 主要发行声明 我正在使用foreach的%dopar%和caretList(来自caretEnsemble软件包)并行培训约25000个模型。由于R崩溃和内存问题,我需要将每个预测保存为一个单独的对象,因此我的工作流程类似于此-有关可复制的示例,请参见下文 cl <- makePSOCKcluster(4) clusterEvalQ(cl, library(foreach)) registerDoParal

(这是我第一次尝试创建一个可复制的示例问题-请随时用更好的方式来描述或说明问题!)

主要发行声明

我正在使用
foreach
%dopar%
caretList
(来自
caretEnsemble
软件包)并行培训约25000个模型。由于R崩溃和内存问题,我需要将每个预测保存为一个单独的对象,因此我的工作流程类似于此-有关可复制的示例,请参见下文

cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)

multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretEnsemble")) %dopar% {
  tryCatch({
    results <- caretList(mpg ~ cyl,data=mtcars,trControl=fitControl,methodList=c("glmnet","lm","earth"),continue_on_fail = TRUE)
    for (i in 1:length(results)) {
      results[[i]]$trainingData <- c() ## should be trimming out trainingData
    }
    save(results,file="foreach_results.Rdata") ## export each caretList as its own object
    1
  },
  error = function(e) {
    write.csv(e$message,file="foreach_failure.txt") ## monitor failures as needed
    0
  }
  )
}
这个对象大致相同,在Windows中大约为156KB。那么,Windows中保存的对象大小增加了什么

在实际的工作流中,较小的非
foreach
对象平均约为4MB,而较大的
foreach
对象平均约为10MB,因此,当我保存大约25000个文件时,这会产生真正的存储问题

  • 为什么保存在foreach循环中的对象大小要大得多,如果我能做些什么呢?
注释

  • 我的假设是
    foreach
    中的
    save
    保存整个环境
    :不只是保存对象,即使使用
    saveRDS
    (见下文)命令保存对象时,也会隐式保存导出到每个工人的环境
  • Trim
    似乎不在
    caretList
    范围内工作:
    Trim
    列车控制
    选项似乎没有调整它应该调整的内容,因为我必须手动添加命令来调整
    列车数据
  • 我当前的解决方法是将
    保存
    压缩设置为
    xz
    :我需要foreach循环来利用多个核心,因此我需要更大的对象。然而,这会将工作流程的速度降低3-4倍,这就是我寻找解决方案的原因
  • 需要PSOCK集群来解决
    插入符号
    并行化中的问题
    :请参阅答案
  • SaveRDS
    对这个问题没有帮助
    :我已经用
    SaveRDS
    而不是
    save
    进行了测试,但是对象大小的差异无处不在
  • 删除
    tryCatch
    无助于解决问题:即使在
    foreach
    循环中没有
    tryCatch
    ,对象大小的差异仍然存在
技术细节

可复制示例:

library(caret)
library(caretEnsemble)

## train a caretList without foreach loop
fitControl <- trainControl(## 10-fold CV
  method = "repeatedcv",
  number = 10,
  ## repeated ten times
  repeats = 10,
  trim=TRUE)

results <- caretList(mpg ~ cyl,data=mtcars,trControl=fitControl,methodList=c("glmnet","lm","earth"),continue_on_fail = TRUE)
for (i in 1:length(results)) {
    results[[i]]$trainingData <- c()
}
object.size(results) ##returns about 546536 bytes
save(results,file="no_foreach_results.Rdata") ##in Windows, this object is about 136 KB

## train a caretList with foreach loop
library(doParallel)

cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)

multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretEnsemble")) %dopar% {
  tryCatch({
    results <- caretList(mpg ~ cyl,data=mtcars,trControl=fitControl,methodList=c("glmnet","lm","earth"),continue_on_fail = TRUE)
    for (i in 1:length(results)) {
      results[[i]]$trainingData <- c()
    }
    save(results,file="foreach_results.Rdata") ## in Windows, this object is about 160 KB
    ## loading this file back in and running object.size gives about 546504 bytes, approximately the same
    1
  },
  error = function(e) {
    write.csv(e$message,file="foreach_failure.txt")
    0
  }
  )
}
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doParallel_1.0.10   iterators_1.0.8     earth_4.4.4         plotmo_3.1.4        TeachingDemos_2.10 
 [6] plotrix_3.6-2       glmnet_2.0-5        foreach_1.4.3       Matrix_1.2-4        caretEnsemble_2.0.0
[11] caret_6.0-64        ggplot2_2.1.0       RevoUtilsMath_8.0.1 RevoUtils_8.0.1     RevoMods_8.0.1     
[16] RevoScaleR_8.0.1    lattice_0.20-33     rpart_4.1-10       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4        compiler_3.2.2     nloptr_1.0.4       plyr_1.8.3         tools_3.2.2       
 [6] lme4_1.1-11        digest_0.6.9       nlme_3.1-126       gtable_0.2.0       mgcv_1.8-12       
[11] SparseM_1.7        gridExtra_2.2.1    stringr_1.0.0      MatrixModels_0.4-1 stats4_3.2.2      
[16] grid_3.2.2         nnet_7.3-12        data.table_1.9.6   pbapply_1.2-1      minqa_1.2.4       
[21] reshape2_1.4.1     car_2.1-2          magrittr_1.5       scales_0.4.0       codetools_0.2-14  
[26] MASS_7.3-45        splines_3.2.2      pbkrtest_0.4-6     colorspace_1.2-6   quantreg_5.21     
[31] stringi_1.0-1      munsell_0.4.3      chron_2.3-47  

我也不知道为什么,但我想到的解决办法是运行

rm(列车数据)

从环境中删除任何重载存储数据集(例如,培训数据集),并停止将其保存到磁盘

(很高兴不只是我疯了。)

R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] doParallel_1.0.10   iterators_1.0.8     earth_4.4.4         plotmo_3.1.4        TeachingDemos_2.10 
 [6] plotrix_3.6-2       glmnet_2.0-5        foreach_1.4.3       Matrix_1.2-4        caretEnsemble_2.0.0
[11] caret_6.0-64        ggplot2_2.1.0       RevoUtilsMath_8.0.1 RevoUtils_8.0.1     RevoMods_8.0.1     
[16] RevoScaleR_8.0.1    lattice_0.20-33     rpart_4.1-10       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.4        compiler_3.2.2     nloptr_1.0.4       plyr_1.8.3         tools_3.2.2       
 [6] lme4_1.1-11        digest_0.6.9       nlme_3.1-126       gtable_0.2.0       mgcv_1.8-12       
[11] SparseM_1.7        gridExtra_2.2.1    stringr_1.0.0      MatrixModels_0.4-1 stats4_3.2.2      
[16] grid_3.2.2         nnet_7.3-12        data.table_1.9.6   pbapply_1.2-1      minqa_1.2.4       
[21] reshape2_1.4.1     car_2.1-2          magrittr_1.5       scales_0.4.0       codetools_0.2-14  
[26] MASS_7.3-45        splines_3.2.2      pbkrtest_0.4-6     colorspace_1.2-6   quantreg_5.21     
[31] stringi_1.0-1      munsell_0.4.3      chron_2.3-47