Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/79.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R-如何在并行核中的组上嵌套数据帧_R_Parallel Processing_Dplyr_Tidyr - Fatal编程技术网

R-如何在并行核中的组上嵌套数据帧

R-如何在并行核中的组上嵌套数据帧,r,parallel-processing,dplyr,tidyr,R,Parallel Processing,Dplyr,Tidyr,知道如何在并行内核中运行以下操作吗 库和示例数据 任何帮助都将不胜感激 使用多个内核简单地嵌套一个数据帧将不会有效率。所以我假设你想做一些其他的计算。下面的示例计算摘要,每个组id都有几个值 multidplyr软件包为这种事情提供了方便 # replace plyr with multidplyr libs <- c("dplyr", "tidyr",'multidplyr') devtools::install_github("hadley/multidplyr") sapply(li

知道如何在并行内核中运行以下操作吗

库和示例数据
任何帮助都将不胜感激

使用多个内核简单地嵌套一个
数据帧将不会有效率。所以我假设你想做一些其他的计算。下面的示例计算
摘要
,每个组id都有几个值

multidplyr
软件包为这种事情提供了方便

# replace plyr with multidplyr
libs <- c("dplyr", "tidyr",'multidplyr')
devtools::install_github("hadley/multidplyr")
sapply(libs, require, character.only = T)

set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE), 
                 value = runif(100000))%>%as.tbl

# first the single core solution. No need to nest, 
# since group_by%>%do() automatically nests.
x<-df%>% 
  group_by(id)%>%
  # nest()%>%
  do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
  ungroup  

# next, multiple core solution
n_cores<-2
cl<-multidplyr::create_cluster(n_cores)
# you have to load the packages into each cluster
cluster_library(cl,c('dplyr','tidyr')) 
df_mp<-df%>%multidplyr::partition(cluster = cl,id) # group by id

x_mp<-df_mp%>% 
  do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
  collect()%>%
  ungroup
#将plyr替换为multidplyr
libs%do()自动嵌套。
x%
分组依据(id)%>%
#嵌套()%>%
do(统计汇总=汇总(.$value)%%>%as.matrix%%>%t%%>%data.frame%%>%as.tbl)%%>%
解组
#接下来是多核解决方案
n_核心%t%%>%data.frame%%>%as.tbl)%%>%
收集()%>%
解组
结果相符。除非您的计算速度慢于将数据加载到每个不同的进程,否则您可能不会获得太多的速度

all.equal(unnest(x_mp),unnest(x))
x_mp

TRUE
# A tibble: 10 x 2
      id     stat_summary
   <int>           <list>
 1     3 <tibble [1 x 6]>
 2     5 <tibble [1 x 6]>
 3     6 <tibble [1 x 6]>
 4     7 <tibble [1 x 6]>
 5     1 <tibble [1 x 6]>
 6     2 <tibble [1 x 6]>
 7     4 <tibble [1 x 6]>
 8     8 <tibble [1 x 6]>
 9     9 <tibble [1 x 6]>
10    10 <tibble [1 x 6]>
all.equal(unnest(x_mp),unnest(x))
x_mp
真的
#一个tibble:10x2
id统计摘要
1     3 
2     5 
3     6 
4     7 
5     1 
6     2 
7     4 
8     8 
9     9 
10    10 

是否希望每个组在不同的核心上运行?除了
nest
,您还想对数据执行其他操作吗,如
sum
summary
,等等?谢谢@dule aranux我不知道multidplyr包,可以解决我的问题。。
# replace plyr with multidplyr
libs <- c("dplyr", "tidyr",'multidplyr')
devtools::install_github("hadley/multidplyr")
sapply(libs, require, character.only = T)

set.seed(1)
df <- data.frame(id = sample(1:10, 100000, TRUE), 
                 value = runif(100000))%>%as.tbl

# first the single core solution. No need to nest, 
# since group_by%>%do() automatically nests.
x<-df%>% 
  group_by(id)%>%
  # nest()%>%
  do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
  ungroup  

# next, multiple core solution
n_cores<-2
cl<-multidplyr::create_cluster(n_cores)
# you have to load the packages into each cluster
cluster_library(cl,c('dplyr','tidyr')) 
df_mp<-df%>%multidplyr::partition(cluster = cl,id) # group by id

x_mp<-df_mp%>% 
  do(stat_summary=summary(.$value)%>%as.matrix%>%t%>%data.frame%>%as.tbl)%>%
  collect()%>%
  ungroup
all.equal(unnest(x_mp),unnest(x))
x_mp

TRUE
# A tibble: 10 x 2
      id     stat_summary
   <int>           <list>
 1     3 <tibble [1 x 6]>
 2     5 <tibble [1 x 6]>
 3     6 <tibble [1 x 6]>
 4     7 <tibble [1 x 6]>
 5     1 <tibble [1 x 6]>
 6     2 <tibble [1 x 6]>
 7     4 <tibble [1 x 6]>
 8     8 <tibble [1 x 6]>
 9     9 <tibble [1 x 6]>
10    10 <tibble [1 x 6]>