分组函数(tapply、by、aggregate)和*apply族
每当我想在R中“映射”py时,我通常尝试使用分组函数(tapply、by、aggregate)和*apply族,r,lapply,sapply,tapply,r-faq,R,Lapply,Sapply,Tapply,R Faq,每当我想在R中“映射”py时,我通常尝试使用apply家族中的函数 然而,我从来没有完全理解它们之间的区别--{sapply,lapply,等等}如何将函数应用于输入/分组输入,输出将是什么样子,甚至输入可以是什么--所以我经常只是仔细检查它们,直到得到我想要的 有人能解释一下在什么时候使用哪一个吗 我目前(可能不正确/不完整)的理解是 sapply(vec,f):输入是一个向量。输出是一个向量/矩阵,其中元素i是f(vec[i]),如果f具有多元素输出,则为您提供一个矩阵 lappy(vec,
apply
家族中的函数
然而,我从来没有完全理解它们之间的区别--{sapply
,lapply
,等等}如何将函数应用于输入/分组输入,输出将是什么样子,甚至输入可以是什么--所以我经常只是仔细检查它们,直到得到我想要的
有人能解释一下在什么时候使用哪一个吗
我目前(可能不正确/不完整)的理解是
sapply(vec,f)
:输入是一个向量。输出是一个向量/矩阵,其中元素i
是f(vec[i])
,如果f
具有多元素输出,则为您提供一个矩阵lappy(vec,f)
:与sapply
相同,但输出是列表吗应用(矩阵,1/2,f)
:输入是一个矩阵。输出是一个向量,其中元素i
是f(矩阵的行/列i)taply(vector,grouping,f)
:输出是一个矩阵/数组,其中矩阵/数组中的一个元素是向量的分组g
处的f
,并且g
被推送到行/列名称by(dataframe,grouping,f)
:让g
成为一个分组。将f
应用于组/数据帧的每一列。漂亮地打印分组和每列的f
值aggregate(矩阵,分组,f)
:类似于by
,但aggregate不是漂亮地打印输出,而是将所有内容粘贴到数据帧中旁白:我还没有学会plyr或重塑——将
plyr
或重塑
完全取代所有这些吗?在旁白中,这里是各种plyr
函数如何对应于基础*应用函数(从plyr网页的简介到plyr文档)
plyr
的目标之一是为每个函数提供一致的命名约定,在函数名中编码输入和输出数据类型。它还提供了输出的一致性,因为dlply()
的输出很容易通过ldply()
产生有用的输出,等等
从概念上讲,学习plyr
并不比理解基本*应用功能更困难
plyr
和重塑
功能在我的日常使用中几乎取代了所有这些功能。但是,从Plyr简介文件中也可以看出:
相关功能taply
和sweep
在plyr
中没有相应的功能,仍然有用<代码>合并
用于将摘要与原始数据相结合
R具有许多*应用功能,这些功能在帮助文件中有详细描述(例如,?应用
)。然而,它们已经足够多了,刚开始使用的用户可能很难决定哪一个适合他们的情况,甚至很难记住它们。他们可能有一种普遍的感觉,“我应该在这里使用一个*apply函数”,但一开始很难让他们都明白
尽管非常流行的plyr
软件包涵盖了*apply系列的许多功能(在其他答案中指出),但基本功能仍然很有用,值得了解
此答案旨在作为新用户的一种路标,帮助他们针对特定问题使用正确的*apply功能。注意,这不是简单地反刍或替换R文档!希望这个答案能帮助您决定哪一个应用函数适合您的情况,然后由您进一步研究。除了一个例外,性能差异将不会得到解决
- 应用-当您要将函数应用于行或列时
矩阵(和更高维的类似物);通常不建议使用数据帧,因为它将首先强制使用矩阵
- sapply-当您要将函数应用于
依次列出,但您希望返回一个向量,而不是列表
如果发现自己键入<代码> UNLIST(LApple(…))< /代码>,停止并考虑
sapply
x <- list(a = 1, b = 1:3, c = 10:100)
# Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005
如果我们的函数返回一个二维矩阵,sapply
将执行基本相同的操作,将每个返回的矩阵视为单个长向量:
sapply(1:5,function(x) matrix(x,2,2))
除非我们指定simplify=“array”
,在这种情况下,它将使用单个矩阵构建多维数组:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
当然,这些行为中的每一个都取决于我们返回相同长度或维度的向量或矩阵的函数
- vapply-当您想使用
sapply
但可能需要
从你的代码中挤出更多的速度
对于vapply
,您基本上给R一个例子,说明什么样的事情
您的函数将返回,这可以节省一些强制返回的时间
值以适合单个原子向量
x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
- Map-使用
SIMPLIFY=FALSE
对mappy
进行包装,因此保证返回列表
- rapply-当您希望递归地将函数应用于嵌套列表结构的每个元素时
为了让你了解一下rappy的不寻常之处,我在第一次发布这个答案时就忘了!显然,我相信很多人都在使用它,但是YMMV
rappy
最好用用户定义的应用函数来说明:
# Append ! to string, otherwise increment
myFun <- function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")
在定义子组的地方可以处理更复杂的示例
通过几个因素的独特组合tapply
is
精神上类似于拆分ap
sapply(1:5,function(x) matrix(x,2,2))
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15
# Append ! to string, otherwise increment
myFun <- function(x){
if(is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
# Result is named vector, coerced to character
rapply(l, myFun)
# Result is a nested list like l, with values altered
rapply(l, myFun, how="replace")
x <- 1:20
y <- factor(rep(letters[1:5], each = 4))
tapply(x, y, sum)
a b c d e
10 26 42 58 74
dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
means <- tapply(dfr$a, dfr$f, mean)
## A B C D E
## 2.5 6.5 10.5 14.5 18.5
## great, but putting it back in the data frame is another line:
dfr$m <- means[dfr$f]
dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
dfr
## a f m m2
## 1 A 2.5 2.5
## 2 A 2.5 2.5
## 3 A 2.5 2.5
## 4 A 2.5 2.5
## 5 B 6.5 6.5
## 6 B 6.5 6.5
## 7 B 6.5 6.5
## ...
dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
x <- dfr[x,]
sum(x$m*x$m2)
})
dfr
## a f m m2 foo
## 1 1 A 2.5 2.5 25
## 2 2 A 2.5 2.5 25
## 3 3 A 2.5 2.5 25
## ...
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
library(dplyr)
library(data.table)
set.seed(123)
n = 5e7
k = 5e5
x = runif(n)
grp = sample(k, n, TRUE)
timing = list()
# sapply
timing[["sapply"]] = system.time({
lt = split(x, grp)
r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
})
# lapply
timing[["lapply"]] = system.time({
lt = split(x, grp)
r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
})
# tapply
timing[["tapply"]] = system.time(
r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
)
# by
timing[["by"]] = system.time(
r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# aggregate
timing[["aggregate"]] = system.time(
r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
)
# dplyr
timing[["dplyr"]] = system.time({
df = data_frame(x, grp)
r.dplyr = summarise(group_by(df, grp), sum(x), n())
})
# data.table
timing[["data.table"]] = system.time({
dt = setnames(setDT(list(x, grp)), c("x","grp"))
r.data.table = dt[, .(sum(x), .N), grp]
})
# all output size match to group count
sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table),
function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
# sapply lapply tapply by aggregate dplyr data.table
# TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# print timings
as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
)[,.(fun = V1, elapsed = V2)
][order(-elapsed)]
# fun elapsed
#1: aggregate 109.139
#2: by 25.738
#3: dplyr 18.978
#4: tapply 17.006
#5: lapply 11.524
#6: sapply 11.326
#7: data.table 2.686
The outer product of the arrays X and Y is the array A with dimension
c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =
FUN(X[arrayindex.x], Y[arrayindex.y], ...).
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
mapply(FUN=pmax, A, B)
> mapply(FUN=pmax, A, B)
[1] 1 3 6 9 12
outer(A,B, pmax)
> outer(A,B, pmax)
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 6 9 12
[2,] 3 3 6 9 12
[3,] 5 5 6 9 12
[4,] 7 7 7 9 12
[5,] 9 9 9 9 12
A<-c(1,3,5,7,9)
B<-c(0,3,6,9,12)
C<-list(x=1, y=2)
D<-function(x){x+1}
> eapply(.GlobalEnv, is.function)
$A
[1] FALSE
$B
[1] FALSE
$C
[1] FALSE
$D
[1] TRUE
dataPoints <- matrix(4:15, nrow = 4)
# Find means per column with `apply()`
dataPoints_means <- apply(dataPoints, 2, mean)
# Find standard deviation with `apply()`
dataPoints_sdev <- apply(dataPoints, 2, sd)
# Center the points
dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
# Return the result
dataPoints_Trans1
## [,1] [,2] [,3]
## [1,] -1.5 -1.5 -1.5
## [2,] -0.5 -0.5 -0.5
## [3,] 0.5 0.5 0.5
## [4,] 1.5 1.5 1.5
# Normalize
dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")
# Return the result
dataPoints_Trans2
## [,1] [,2] [,3]
## [1,] -1.1618950 -1.1618950 -1.1618950
## [2,] -0.3872983 -0.3872983 -0.3872983
## [3,] 0.3872983 0.3872983 0.3872983
## [4,] 1.1618950 1.1618950 1.1618950
dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame"), drop = TRUE)
# Apply to columns:
dapply(mtcars, log)
dapply(mtcars, sum)
dapply(mtcars, quantile)
# Apply to rows:
dapply(mtcars, sum, MARGIN = 1)
dapply(mtcars, quantile, MARGIN = 1)
# Return as matrix:
dapply(mtcars, quantile, return = "matrix")
dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
# Same for matrices ...
BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
return = c("same", "matrix", "data.frame", "list"))
# Vectors:
BY(iris$Sepal.Length, iris$Species, sum)
BY(iris$Sepal.Length, iris$Species, quantile)
BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix
# Data.frames
BY(iris[-5], iris$Species, sum)
BY(iris[-5], iris$Species, quantile)
BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
# Same for matrices ...
fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)
v <- iris$Sepal.Length
f <- iris$Species
# Vectors
fmean(v) # mean
fmean(v, f) # grouped mean
fsd(v, f) # grouped standard deviation
fsd(v, f, TRA = "/") # grouped scaling
fscale(v, f) # grouped standardizing (scaling and centering)
fwithin(v, f) # grouped demeaning
w <- abs(rnorm(nrow(iris)))
fmean(v, w = w) # Weighted mean
fmean(v, f, w) # Weighted grouped mean
fsd(v, f, w) # Weighted grouped standard-deviation
fsd(v, f, w, "/") # Weighted grouped scaling
fscale(v, f, w) # Weighted grouped standardizing
fwithin(v, f, w) # Weighted grouped demeaning
# Same using data.frames...
fmean(iris[-5], f) # grouped mean
fscale(iris[-5], f) # grouped standardizing
fwithin(iris[-5], f) # grouped demeaning
# Same with matrices ...