R Split apply与返回多个变量的函数组合
我需要将R Split apply与返回多个变量的函数组合,r,dplyr,plyr,R,Dplyr,Plyr,我需要将myfun应用于数据帧的子集,并将结果作为新列包含在返回的数据帧中。在过去,我使用ddply。但是在dplyr中,我相信summary用于此,如下所示: myfun<- function(x,y) { df<- data.frame( a= mean(x)*mean(y), b= mean(x)-mean(y) ) return (df) } mtcars %>% group_by(cyl) %>% summarise(a
myfun
应用于数据帧的子集,并将结果作为新列包含在返回的数据帧中。在过去,我使用ddply
。但是在dplyr
中,我相信summary
用于此,如下所示:
myfun<- function(x,y) {
df<- data.frame( a= mean(x)*mean(y), b= mean(x)-mean(y) )
return (df)
}
mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl,disp)$a, b = myfun(cyl,disp)$b)
myfun%
总结(a=myfun(cyl,disp)$a,b=myfun(cyl,disp)$b)
上面的代码可以工作,但是我将使用的
myfun
在计算上非常昂贵,因此我希望只调用一次,而不是单独调用a
和b
列。在dplyr
中是否有这样做的方法?由于您的函数返回一个数据帧,您可以通过%>%do调用组中的函数,该函数将函数应用于每个单独的组,并将返回的数据帧rbind在一起:
mtcars %>% group_by(cyl) %>% do(myfun(.$cyl, .$disp))
# A tibble: 3 x 3
# Groups: cyl [3]
# cyl a b
# <dbl> <dbl> <dbl>
#1 4 420.5455 -101.1364
#2 6 1099.8857 -177.3143
#3 8 2824.8000 -345.1000
mtcars%>%group_by(cyl)%>%do(myfun(.$cyl,.$disp))
#一个tibble:3x3
#组别:共青团[3]
#共青团
#
#1 4 420.5455 -101.1364
#2 6 1099.8857 -177.3143
#3 8 2824.8000 -345.1000
由于您的函数返回一个数据帧,您可以通过%>%do调用组中的函数,该函数将函数应用于每个单独的组,并将返回的数据帧绑定在一起:
mtcars %>% group_by(cyl) %>% do(myfun(.$cyl, .$disp))
# A tibble: 3 x 3
# Groups: cyl [3]
# cyl a b
# <dbl> <dbl> <dbl>
#1 4 420.5455 -101.1364
#2 6 1099.8857 -177.3143
#3 8 2824.8000 -345.1000
mtcars%>%group_by(cyl)%>%do(myfun(.$cyl,.$disp))
#一个tibble:3x3
#组别:共青团[3]
#共青团
#
#1 4 420.5455 -101.1364
#2 6 1099.8857 -177.3143
#3 8 2824.8000 -345.1000
我们可以使用数据表
library(data.table)
setDT(mtcars)[, myfun(cyl, disp), cyl]
# cyl a b
#1: 6 1099.8857 -177.3143
#2: 4 420.5455 -101.1364
#3: 8 2824.8000 -345.1000
我们可以使用data.table
library(data.table)
setDT(mtcars)[, myfun(cyl, disp), cyl]
# cyl a b
#1: 6 1099.8857 -177.3143
#2: 4 420.5455 -101.1364
#3: 8 2824.8000 -345.1000
do
不一定能提高速度。在这篇文章中,我将介绍一种设计执行相同任务的函数的方法,然后进行基准测试以比较每种方法的性能
下面是定义函数的另一种方法
myfun2 <- function(dt, x, y){
x <- enquo(x)
y <- enquo(y)
dt2 <- dt %>%
summarise(a = mean(!!x) * mean(!!y), b = mean(!!x) - mean(!!y))
return(dt2)
}
这样,我们就不必每次创建新列时都调用my_fun
。因此,这种方法可能比my\u fun
更有效
下面是使用microbenchmark
的性能比较。我比较的方法如下所示。我运行了1000次模拟
m1: OP's original way to apply `myfun`
m2: Psidom's method, using `do`to apply `myfun`.
m3: My approach, using `myfun2`
m4: Using `do` to apply `myfun2`
m5: Z.Lin's suggestion, directly calculating the values without defining a function.
m6: akrun's `data.table` approach with `myfun`
下面是基准测试的代码
microbenchmark(m1 = (mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
m2 = (mtcars %>%
group_by(cyl) %>%
do(myfun(.$cyl, .$disp))),
m3 = (mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)),
m4 = (mtcars %>%
group_by(cyl) %>%
do(myfun2(., x = cyl, y = disp))),
m5 = (mtcars %>%
group_by(cyl) %>%
summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
m1 7.058227 7.692654 9.429765 8.375190 10.570663 28.730059 1000
m2 8.559296 9.381996 11.643645 10.500100 13.229285 27.585654 1000
m3 6.817031 7.445683 9.423832 8.085241 10.415104 193.878337 1000
m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151 1000
m5 5.337132 5.785528 7.120589 6.223339 7.810686 23.231274 1000
m6 1.320812 1.540199 1.919222 1.640270 1.935352 7.622732 1000
下面是基准测试的结果
microbenchmark(m1 = (mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
m2 = (mtcars %>%
group_by(cyl) %>%
do(myfun(.$cyl, .$disp))),
m3 = (mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)),
m4 = (mtcars %>%
group_by(cyl) %>%
do(myfun2(., x = cyl, y = disp))),
m5 = (mtcars %>%
group_by(cyl) %>%
summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
m1 7.058227 7.692654 9.429765 8.375190 10.570663 28.730059 1000
m2 8.559296 9.381996 11.643645 10.500100 13.229285 27.585654 1000
m3 6.817031 7.445683 9.423832 8.085241 10.415104 193.878337 1000
m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151 1000
m5 5.337132 5.785528 7.120589 6.223339 7.810686 23.231274 1000
m6 1.320812 1.540199 1.919222 1.640270 1.935352 7.622732 1000
结果表明do
方法(m2
和m4
)实际上比它们的对应方法(m1
和m3
)慢。在这种情况下,应用myfun
(m1
)和myfun2
(m3
)比使用do
更快myfun2
(m3
)比myfun
(m1
)稍快。然而,不定义任何函数(m5
)实际上比所有函数定义的方法(m1
到m4
)都要快,这表明对于这种特殊情况,实际上不需要定义函数。最后,如果不需要停留在tidyverse
,或者数据集的大小非常大。我们可以考虑<代码>数据>表<代码>方法(<代码> M6<代码>),这比在这里列出的所有<代码> TyDyErrase/Cult>解决方案要快得多。p> do
不一定能提高速度。在这篇文章中,我将介绍一种设计执行相同任务的函数的方法,然后进行基准测试以比较每种方法的性能
下面是定义函数的另一种方法
myfun2 <- function(dt, x, y){
x <- enquo(x)
y <- enquo(y)
dt2 <- dt %>%
summarise(a = mean(!!x) * mean(!!y), b = mean(!!x) - mean(!!y))
return(dt2)
}
这样,我们就不必每次创建新列时都调用my_fun
。因此,这种方法可能比my\u fun
更有效
下面是使用microbenchmark
的性能比较。我比较的方法如下所示。我运行了1000次模拟
m1: OP's original way to apply `myfun`
m2: Psidom's method, using `do`to apply `myfun`.
m3: My approach, using `myfun2`
m4: Using `do` to apply `myfun2`
m5: Z.Lin's suggestion, directly calculating the values without defining a function.
m6: akrun's `data.table` approach with `myfun`
下面是基准测试的代码
microbenchmark(m1 = (mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
m2 = (mtcars %>%
group_by(cyl) %>%
do(myfun(.$cyl, .$disp))),
m3 = (mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)),
m4 = (mtcars %>%
group_by(cyl) %>%
do(myfun2(., x = cyl, y = disp))),
m5 = (mtcars %>%
group_by(cyl) %>%
summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
m1 7.058227 7.692654 9.429765 8.375190 10.570663 28.730059 1000
m2 8.559296 9.381996 11.643645 10.500100 13.229285 27.585654 1000
m3 6.817031 7.445683 9.423832 8.085241 10.415104 193.878337 1000
m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151 1000
m5 5.337132 5.785528 7.120589 6.223339 7.810686 23.231274 1000
m6 1.320812 1.540199 1.919222 1.640270 1.935352 7.622732 1000
下面是基准测试的结果
microbenchmark(m1 = (mtcars %>%
group_by(cyl) %>%
summarise(a = myfun(cyl, disp)$a, b = myfun(cyl, disp)$b)),
m2 = (mtcars %>%
group_by(cyl) %>%
do(myfun(.$cyl, .$disp))),
m3 = (mtcars %>%
group_by(cyl) %>%
myfun2(x = cyl, y = disp)),
m4 = (mtcars %>%
group_by(cyl) %>%
do(myfun2(., x = cyl, y = disp))),
m5 = (mtcars %>%
group_by(cyl) %>%
summarise(a = mean(cyl) * mean(disp), b = mean(cyl) - mean(disp))),
m6 = (as.data.table(mtcars)[, myfun(cyl, disp), cyl]),
times = 1000)
Unit: milliseconds
expr min lq mean median uq max neval
m1 7.058227 7.692654 9.429765 8.375190 10.570663 28.730059 1000
m2 8.559296 9.381996 11.643645 10.500100 13.229285 27.585654 1000
m3 6.817031 7.445683 9.423832 8.085241 10.415104 193.878337 1000
m4 21.787298 23.995279 28.920262 26.922683 31.673820 177.004151 1000
m5 5.337132 5.785528 7.120589 6.223339 7.810686 23.231274 1000
m6 1.320812 1.540199 1.919222 1.640270 1.935352 7.622732 1000
结果表明do
方法(m2
和m4
)实际上比它们的对应方法(m1
和m3
)慢。在这种情况下,应用myfun
(m1
)和myfun2
(m3
)比使用do
更快myfun2
(m3
)比myfun
(m1
)稍快。然而,不定义任何函数(m5
)实际上比所有函数定义的方法(m1
到m4
)都要快,这表明对于这种特殊情况,实际上不需要定义函数。最后,如果不需要停留在tidyverse
,或者数据集的大小非常大。我们可以考虑<代码>数据>表<代码>方法(<代码> M6<代码>),这比在这里列出的所有<代码> TyDyErrase/Cult>解决方案要快得多。p> 如果myfun
返回多行和多列,请使用do(嵌套列=myfun(3,.$disp))%%>%tidyr::unest()
。如果myfun
返回多行和多列,请使用do(嵌套列=myfun(3,.$disp))%%>%tidyr::unest()
。您打算使用的实际函数是myfun
?因为在这种情况下,我看不出在函数中创建数据帧的意义,因为您只想返回2个数字mtcars%%>%group\U by(cyl)%%>%SUMMARESE(a=平均值(cyl)*平均值(disp),b=平均值(cyl)-平均值(disp))
应返回相同的结果,除非我遗漏了什么。我将使用的myfun
在计算上非常昂贵。myfun
是您打算使用的实际函数吗?因为在这种情况下,我看不出在函数中创建数据帧的意义,因为您只想返回2个数字<代码>mtcars%>%group\U by(cyl)%>%SUMMARESE(a=平均值(cyl)*平均值(disp),b=平均值(cyl)-平均值(disp))
应返回相同的结果,除非我遗漏了什么。Th