为什么save和SAVERD在dopar中的行为不同?
(这是我第一次尝试创建一个可复制的示例问题-请随时用更好的方式来描述或说明问题!) 主要发行声明 我正在使用为什么save和SAVERD在dopar中的行为不同?,r,foreach,machine-learning,parallel-processing,r-caret,R,Foreach,Machine Learning,Parallel Processing,R Caret,(这是我第一次尝试创建一个可复制的示例问题-请随时用更好的方式来描述或说明问题!) 主要发行声明 我正在使用foreach的%dopar%和caretList(来自caretEnsemble软件包)并行培训约25000个模型。由于R崩溃和内存问题,我需要将每个预测保存为一个单独的对象,因此我的工作流程类似于此-有关可复制的示例,请参见下文 cl <- makePSOCKcluster(4) clusterEvalQ(cl, library(foreach)) registerDoParal
foreach
的%dopar%
和caretList
(来自caretEnsemble
软件包)并行培训约25000个模型。由于R崩溃和内存问题,我需要将每个预测保存为一个单独的对象,因此我的工作流程类似于此-有关可复制的示例,请参见下文
cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)
multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretEnsemble")) %dopar% {
tryCatch({
results <- caretList(mpg ~ cyl,data=mtcars,trControl=fitControl,methodList=c("glmnet","lm","earth"),continue_on_fail = TRUE)
for (i in 1:length(results)) {
results[[i]]$trainingData <- c() ## should be trimming out trainingData
}
save(results,file="foreach_results.Rdata") ## export each caretList as its own object
1
},
error = function(e) {
write.csv(e$message,file="foreach_failure.txt") ## monitor failures as needed
0
}
)
}
这个对象大致相同,在Windows中大约为156KB。那么,Windows中保存的对象大小增加了什么
在实际的工作流中,较小的非foreach
对象平均约为4MB,而较大的foreach
对象平均约为10MB,因此,当我保存大约25000个文件时,这会产生真正的存储问题
- 为什么保存在foreach循环中的对象大小要大得多,如果我能做些什么呢?
- 我的假设是
中的foreach
保存整个环境:不只是保存对象,即使使用save
(见下文)命令保存对象时,也会隐式保存导出到每个工人的环境saveRDS
似乎不在Trim
范围内工作:caretList
Trim
选项似乎没有调整它应该调整的内容,因为我必须手动添加命令来调整列车控制
列车数据
- 我当前的解决方法是将
压缩设置为保存
:我需要foreach循环来利用多个核心,因此我需要更大的对象。然而,这会将工作流程的速度降低3-4倍,这就是我寻找解决方案的原因xz
- 需要PSOCK集群来解决
并行化中的问题:请参阅答案插入符号
对这个问题没有帮助:我已经用SaveRDS
而不是SaveRDS
进行了测试,但是对象大小的差异无处不在save
- 删除
无助于解决问题:即使在tryCatch
循环中没有foreach
,对象大小的差异仍然存在tryCatch
library(caret)
library(caretEnsemble)
## train a caretList without foreach loop
fitControl <- trainControl(## 10-fold CV
method = "repeatedcv",
number = 10,
## repeated ten times
repeats = 10,
trim=TRUE)
results <- caretList(mpg ~ cyl,data=mtcars,trControl=fitControl,methodList=c("glmnet","lm","earth"),continue_on_fail = TRUE)
for (i in 1:length(results)) {
results[[i]]$trainingData <- c()
}
object.size(results) ##returns about 546536 bytes
save(results,file="no_foreach_results.Rdata") ##in Windows, this object is about 136 KB
## train a caretList with foreach loop
library(doParallel)
cl <- makePSOCKcluster(4)
clusterEvalQ(cl, library(foreach))
registerDoParallel(cl)
multiple.forecasts <- foreach(x=1:1,.combine='rbind',.packages=c('zoo','earth','caret',"glmnet","caretEnsemble")) %dopar% {
tryCatch({
results <- caretList(mpg ~ cyl,data=mtcars,trControl=fitControl,methodList=c("glmnet","lm","earth"),continue_on_fail = TRUE)
for (i in 1:length(results)) {
results[[i]]$trainingData <- c()
}
save(results,file="foreach_results.Rdata") ## in Windows, this object is about 160 KB
## loading this file back in and running object.size gives about 546504 bytes, approximately the same
1
},
error = function(e) {
write.csv(e$message,file="foreach_failure.txt")
0
}
)
}
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doParallel_1.0.10 iterators_1.0.8 earth_4.4.4 plotmo_3.1.4 TeachingDemos_2.10
[6] plotrix_3.6-2 glmnet_2.0-5 foreach_1.4.3 Matrix_1.2-4 caretEnsemble_2.0.0
[11] caret_6.0-64 ggplot2_2.1.0 RevoUtilsMath_8.0.1 RevoUtils_8.0.1 RevoMods_8.0.1
[16] RevoScaleR_8.0.1 lattice_0.20-33 rpart_4.1-10
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 compiler_3.2.2 nloptr_1.0.4 plyr_1.8.3 tools_3.2.2
[6] lme4_1.1-11 digest_0.6.9 nlme_3.1-126 gtable_0.2.0 mgcv_1.8-12
[11] SparseM_1.7 gridExtra_2.2.1 stringr_1.0.0 MatrixModels_0.4-1 stats4_3.2.2
[16] grid_3.2.2 nnet_7.3-12 data.table_1.9.6 pbapply_1.2-1 minqa_1.2.4
[21] reshape2_1.4.1 car_2.1-2 magrittr_1.5 scales_0.4.0 codetools_0.2-14
[26] MASS_7.3-45 splines_3.2.2 pbkrtest_0.4-6 colorspace_1.2-6 quantreg_5.21
[31] stringi_1.0-1 munsell_0.4.3 chron_2.3-47
我也不知道为什么,但我想到的解决办法是运行 rm(列车数据) 从环境中删除任何重载存储数据集(例如,培训数据集),并停止将其保存到磁盘 (很高兴不只是我疯了。)
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] doParallel_1.0.10 iterators_1.0.8 earth_4.4.4 plotmo_3.1.4 TeachingDemos_2.10
[6] plotrix_3.6-2 glmnet_2.0-5 foreach_1.4.3 Matrix_1.2-4 caretEnsemble_2.0.0
[11] caret_6.0-64 ggplot2_2.1.0 RevoUtilsMath_8.0.1 RevoUtils_8.0.1 RevoMods_8.0.1
[16] RevoScaleR_8.0.1 lattice_0.20-33 rpart_4.1-10
loaded via a namespace (and not attached):
[1] Rcpp_0.12.4 compiler_3.2.2 nloptr_1.0.4 plyr_1.8.3 tools_3.2.2
[6] lme4_1.1-11 digest_0.6.9 nlme_3.1-126 gtable_0.2.0 mgcv_1.8-12
[11] SparseM_1.7 gridExtra_2.2.1 stringr_1.0.0 MatrixModels_0.4-1 stats4_3.2.2
[16] grid_3.2.2 nnet_7.3-12 data.table_1.9.6 pbapply_1.2-1 minqa_1.2.4
[21] reshape2_1.4.1 car_2.1-2 magrittr_1.5 scales_0.4.0 codetools_0.2-14
[26] MASS_7.3-45 splines_3.2.2 pbkrtest_0.4-6 colorspace_1.2-6 quantreg_5.21
[31] stringi_1.0-1 munsell_0.4.3 chron_2.3-47