分组函数(tapply、by、aggregate)和*apply族

分组函数(tapply、by、aggregate)和*apply族,r,lapply,sapply,tapply,r-faq,R,Lapply,Sapply,Tapply,R Faq,每当我想在R中“映射”py时,我通常尝试使用apply家族中的函数 然而,我从来没有完全理解它们之间的区别--{sapply,lapply,等等}如何将函数应用于输入/分组输入,输出将是什么样子,甚至输入可以是什么--所以我经常只是仔细检查它们,直到得到我想要的 有人能解释一下在什么时候使用哪一个吗 我目前(可能不正确/不完整)的理解是 sapply(vec,f):输入是一个向量。输出是一个向量/矩阵,其中元素i是f(vec[i]),如果f具有多元素输出,则为您提供一个矩阵 lappy(vec,

每当我想在R中“映射”py时,我通常尝试使用
apply
家族中的函数

然而,我从来没有完全理解它们之间的区别--{
sapply
lapply
,等等}如何将函数应用于输入/分组输入,输出将是什么样子,甚至输入可以是什么--所以我经常只是仔细检查它们,直到得到我想要的

有人能解释一下在什么时候使用哪一个吗

我目前(可能不正确/不完整)的理解是

  • sapply(vec,f)
    :输入是一个向量。输出是一个向量/矩阵,其中元素
    i
    f(vec[i])
    ,如果
    f
    具有多元素输出,则为您提供一个矩阵

  • lappy(vec,f)
    :与
    sapply
    相同,但输出是列表吗

  • 应用(矩阵,1/2,f)
    :输入是一个矩阵。输出是一个向量,其中元素
    i
    是f(矩阵的行/列i)
  • taply(vector,grouping,f)
    :输出是一个矩阵/数组,其中矩阵/数组中的一个元素是向量的分组
    g
    处的
    f
    ,并且
    g
    被推送到行/列名称
  • by(dataframe,grouping,f)
    :让
    g
    成为一个分组。将
    f
    应用于组/数据帧的每一列。漂亮地打印分组和每列的
    f
  • aggregate(矩阵,分组,f)
    :类似于
    by
    ,但aggregate不是漂亮地打印输出,而是将所有内容粘贴到数据帧中

  • 旁白:我还没有学会plyr或重塑——将
    plyr
    重塑
    完全取代所有这些吗?

    在旁白中,这里是各种
    plyr
    函数如何对应于基础
    *应用
    函数(从plyr网页的简介到plyr文档)

    plyr
    的目标之一是为每个函数提供一致的命名约定,在函数名中编码输入和输出数据类型。它还提供了输出的一致性,因为
    dlply()
    的输出很容易通过
    ldply()
    产生有用的输出,等等

    从概念上讲,学习
    plyr
    并不比理解基本
    *应用
    功能更困难

    plyr
    重塑
    功能在我的日常使用中几乎取代了所有这些功能。但是,从Plyr简介文件中也可以看出:

    相关功能
    taply
    sweep
    plyr
    中没有相应的功能,仍然有用<代码>合并
    用于将摘要与原始数据相结合


    R具有许多*应用功能,这些功能在帮助文件中有详细描述(例如,
    ?应用
    )。然而,它们已经足够多了,刚开始使用的用户可能很难决定哪一个适合他们的情况,甚至很难记住它们。他们可能有一种普遍的感觉,“我应该在这里使用一个*apply函数”,但一开始很难让他们都明白

    尽管非常流行的
    plyr
    软件包涵盖了*apply系列的许多功能(在其他答案中指出),但基本功能仍然很有用,值得了解

    此答案旨在作为新用户的一种路标,帮助他们针对特定问题使用正确的*apply功能。注意,这不是简单地反刍或替换R文档!希望这个答案能帮助您决定哪一个应用函数适合您的情况,然后由您进一步研究。除了一个例外,性能差异将不会得到解决

    • 应用-当您要将函数应用于行或列时 矩阵(和更高维的类似物);通常不建议使用数据帧,因为它将首先强制使用矩阵

    • sapply-当您要将函数应用于 依次列出,但您希望返回一个向量,而不是列表

      如果发现自己键入<代码> UNLIST(LApple(…))< /代码>,停止并考虑
      sapply

       x <- list(a = 1, b = 1:3, c = 10:100)
       # Compare with above; a named vector, not a list 
       sapply(x, FUN = length)  
       a  b  c   
       1  3 91
      
       sapply(x, FUN = sum)   
       a    b    c    
       1    6 5005 
      
      如果我们的函数返回一个二维矩阵,
      sapply
      将执行基本相同的操作,将每个返回的矩阵视为单个长向量:

       sapply(1:5,function(x) matrix(x,2,2))
      
      除非我们指定
      simplify=“array”
      ,在这种情况下,它将使用单个矩阵构建多维数组:

       sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
      
      当然,这些行为中的每一个都取决于我们返回相同长度或维度的向量或矩阵的函数

    • vapply-当您想使用
      sapply
      但可能需要 从你的代码中挤出更多的速度

      对于
      vapply
      ,您基本上给R一个例子,说明什么样的事情 您的函数将返回,这可以节省一些强制返回的时间 值以适合单个原子向量

       x <- list(a = 1, b = 1:3, c = 10:100)
       #Note that since the advantage here is mainly speed, this
       # example is only for illustration. We're telling R that
       # everything returned by length() should be an integer of 
       # length 1. 
       vapply(x, FUN = length, FUN.VALUE = 0L) 
       a  b  c  
       1  3 91
      
    • Map-使用
      SIMPLIFY=FALSE
      mappy
      进行包装,因此保证返回列表

    • rapply-当您希望递归地将函数应用于嵌套列表结构的每个元素时

      为了让你了解一下rappy的不寻常之处,我在第一次发布这个答案时就忘了!显然,我相信很多人都在使用它,但是YMMV
      rappy
      最好用用户定义的应用函数来说明:

       # Append ! to string, otherwise increment
       myFun <- function(x){
           if(is.character(x)){
             return(paste(x,"!",sep=""))
           }
           else{
             return(x + 1)
           }
       }
      
       #A nested list structure
       l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
                 b = 3, c = "Yikes", 
                 d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
      
      
       # Result is named vector, coerced to character          
       rapply(l, myFun)
      
       # Result is a nested list like l, with values altered
       rapply(l, myFun, how="replace")
      
      在定义子组的地方可以处理更复杂的示例 通过几个因素的独特组合
      tapply
      is 精神上类似于拆分ap
       sapply(1:5,function(x) matrix(x,2,2))
      
       sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
      
       x <- list(a = 1, b = 1:3, c = 10:100)
       #Note that since the advantage here is mainly speed, this
       # example is only for illustration. We're telling R that
       # everything returned by length() should be an integer of 
       # length 1. 
       vapply(x, FUN = length, FUN.VALUE = 0L) 
       a  b  c  
       1  3 91
      
       #Sums the 1st elements, the 2nd elements, etc. 
       mapply(sum, 1:5, 1:5, 1:5) 
       [1]  3  6  9 12 15
       #To do rep(1,4), rep(2,3), etc.
       mapply(rep, 1:4, 4:1)   
       [[1]]
       [1] 1 1 1 1
      
       [[2]]
       [1] 2 2 2
      
       [[3]]
       [1] 3 3
      
       [[4]]
       [1] 4
      
       Map(sum, 1:5, 1:5, 1:5)
       [[1]]
       [1] 3
      
       [[2]]
       [1] 6
      
       [[3]]
       [1] 9
      
       [[4]]
       [1] 12
      
       [[5]]
       [1] 15
      
       # Append ! to string, otherwise increment
       myFun <- function(x){
           if(is.character(x)){
             return(paste(x,"!",sep=""))
           }
           else{
             return(x + 1)
           }
       }
      
       #A nested list structure
       l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"), 
                 b = 3, c = "Yikes", 
                 d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
      
      
       # Result is named vector, coerced to character          
       rapply(l, myFun)
      
       # Result is a nested list like l, with values altered
       rapply(l, myFun, how="replace")
      
       x <- 1:20
      
       y <- factor(rep(letters[1:5], each = 4))
      
       tapply(x, y, sum)  
        a  b  c  d  e  
       10 26 42 58 74 
      
      dfr <- data.frame(a=1:20, f=rep(LETTERS[1:5], each=4))
      means <- tapply(dfr$a, dfr$f, mean)
      ##  A    B    C    D    E 
      ## 2.5  6.5 10.5 14.5 18.5 
      
      ## great, but putting it back in the data frame is another line:
      
      dfr$m <- means[dfr$f]
      
      dfr$m2 <- ave(dfr$a, dfr$f, FUN=mean) # NB argument name FUN is needed!
      dfr
      ##   a f    m   m2
      ##   1 A  2.5  2.5
      ##   2 A  2.5  2.5
      ##   3 A  2.5  2.5
      ##   4 A  2.5  2.5
      ##   5 B  6.5  6.5
      ##   6 B  6.5  6.5
      ##   7 B  6.5  6.5
      ##   ...
      
      dfr$foo <- ave(1:nrow(dfr), dfr$f, FUN=function(x) {
          x <- dfr[x,]
          sum(x$m*x$m2)
      })
      dfr
      ##     a f    m   m2    foo
      ## 1   1 A  2.5  2.5    25
      ## 2   2 A  2.5  2.5    25
      ## 3   3 A  2.5  2.5    25
      ## ...
      
      ct <- tapply(iris$Sepal.Width , iris$Species , summary )
      cb <- by(iris$Sepal.Width , iris$Species , summary )
      
       cb
      iris$Species: setosa
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        2.300   3.200   3.400   3.428   3.675   4.400 
      -------------------------------------------------------------- 
      iris$Species: versicolor
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        2.000   2.525   2.800   2.770   3.000   3.400 
      -------------------------------------------------------------- 
      iris$Species: virginica
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        2.200   2.800   3.000   2.974   3.175   3.800 
      
      
      ct
      $setosa
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        2.300   3.200   3.400   3.428   3.675   4.400 
      
      $versicolor
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        2.000   2.525   2.800   2.770   3.000   3.400 
      
      $virginica
         Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
        2.200   2.800   3.000   2.974   3.175   3.800 
      
       tapply(iris, iris$Species, summary )
      Error in tapply(iris, iris$Species, summary) : 
        arguments must have same length
      
      bywork <- by(iris, iris$Species, summary )
      
      bywork
      iris$Species: setosa
        Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
       Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100   setosa    :50  
       1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200   versicolor: 0  
       Median :5.000   Median :3.400   Median :1.500   Median :0.200   virginica : 0  
       Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246                  
       3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300                  
       Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600                  
      -------------------------------------------------------------- 
      iris$Species: versicolor
        Sepal.Length    Sepal.Width     Petal.Length   Petal.Width          Species  
       Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000   setosa    : 0  
       1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200   versicolor:50  
       Median :5.900   Median :2.800   Median :4.35   Median :1.300   virginica : 0  
       Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326                  
       3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500                  
       Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800                  
      -------------------------------------------------------------- 
      iris$Species: virginica
        Sepal.Length    Sepal.Width     Petal.Length    Petal.Width          Species  
       Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400   setosa    : 0  
       1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800   versicolor: 0  
       Median :6.500   Median :3.000   Median :5.550   Median :2.000   virginica :50  
       Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026                  
       3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300                  
       Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500     
      
       by(iris, iris$Species, mean)
      iris$Species: setosa
      [1] NA
      ------------------------------------------- 
      iris$Species: versicolor
      [1] NA
      ------------------------------------------- 
      iris$Species: virginica
      [1] NA
      Warning messages:
      1: In mean.default(data[x, , drop = FALSE], ...) :
        argument is not numeric or logical: returning NA
      2: In mean.default(data[x, , drop = FALSE], ...) :
        argument is not numeric or logical: returning NA
      3: In mean.default(data[x, , drop = FALSE], ...) :
        argument is not numeric or logical: returning NA
      
      at <- tapply(iris$Sepal.Length , iris$Species , mean)
      ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
      
       at
          setosa versicolor  virginica 
           5.006      5.936      6.588 
       ag
           Group.1     x
      1     setosa 5.006
      2 versicolor 5.936
      3  virginica 6.588
      
      ag <- aggregate(len ~ ., data = ToothGrowth, mean)
      
       ag
        supp dose   len
      1   OJ  0.5 13.23
      2   VC  0.5  7.98
      3   OJ  1.0 22.70
      4   VC  1.0 16.77
      5   OJ  2.0 26.06
      6   VC  2.0 26.14
      
      att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
      
       att
             OJ    VC
      0.5 13.23  7.98
      1   22.70 16.77
      2   26.06 26.14
      
       ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
      
       ag1
        Month    Ozone     Temp
      1     5 23.61538 66.73077
      2     6 29.44444 78.22222
      3     7 59.11538 83.88462
      4     8 59.96154 83.96154
      5     9 31.44828 76.89655
      
      ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
      ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
      
       cbind(ta1, ta2)
             ta1      ta2
      5 23.61538 65.54839
      6 29.44444 79.10000
      7 59.11538 83.90323
      8 59.96154 83.96774
      9 31.44828 76.90000
      
      by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
      
      byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
      aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
      
      library(dplyr)
      library(data.table)
      set.seed(123)
      n = 5e7
      k = 5e5
      x = runif(n)
      grp = sample(k, n, TRUE)
      
      timing = list()
      
      # sapply
      timing[["sapply"]] = system.time({
          lt = split(x, grp)
          r.sapply = sapply(lt, function(x) list(sum(x), length(x)), simplify = FALSE)
      })
      
      # lapply
      timing[["lapply"]] = system.time({
          lt = split(x, grp)
          r.lapply = lapply(lt, function(x) list(sum(x), length(x)))
      })
      
      # tapply
      timing[["tapply"]] = system.time(
          r.tapply <- tapply(x, list(grp), function(x) list(sum(x), length(x)))
      )
      
      # by
      timing[["by"]] = system.time(
          r.by <- by(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
      )
      
      # aggregate
      timing[["aggregate"]] = system.time(
          r.aggregate <- aggregate(x, list(grp), function(x) list(sum(x), length(x)), simplify = FALSE)
      )
      
      # dplyr
      timing[["dplyr"]] = system.time({
          df = data_frame(x, grp)
          r.dplyr = summarise(group_by(df, grp), sum(x), n())
      })
      
      # data.table
      timing[["data.table"]] = system.time({
          dt = setnames(setDT(list(x, grp)), c("x","grp"))
          r.data.table = dt[, .(sum(x), .N), grp]
      })
      
      # all output size match to group count
      sapply(list(sapply=r.sapply, lapply=r.lapply, tapply=r.tapply, by=r.by, aggregate=r.aggregate, dplyr=r.dplyr, data.table=r.data.table), 
             function(x) (if(is.data.frame(x)) nrow else length)(x)==k)
      #    sapply     lapply     tapply         by  aggregate      dplyr data.table 
      #      TRUE       TRUE       TRUE       TRUE       TRUE       TRUE       TRUE 
      
      # print timings
      as.data.table(sapply(timing, `[[`, "elapsed"), keep.rownames = TRUE
                    )[,.(fun = V1, elapsed = V2)
                      ][order(-elapsed)]
      #          fun elapsed
      #1:  aggregate 109.139
      #2:         by  25.738
      #3:      dplyr  18.978
      #4:     tapply  17.006
      #5:     lapply  11.524
      #6:     sapply  11.326
      #7: data.table   2.686
      
      The outer product of the arrays X and Y is the array A with dimension  
      c(dim(X), dim(Y)) where element A[c(arrayindex.x, arrayindex.y)] =   
      FUN(X[arrayindex.x], Y[arrayindex.y], ...).
      
       A<-c(1,3,5,7,9)
       B<-c(0,3,6,9,12)
      
      mapply(FUN=pmax, A, B)
      
      > mapply(FUN=pmax, A, B)
      [1]  1  3  6  9 12
      
      outer(A,B, pmax)
      
       > outer(A,B, pmax)
            [,1] [,2] [,3] [,4] [,5]
       [1,]    1    3    6    9   12
       [2,]    3    3    6    9   12
       [3,]    5    5    6    9   12
       [4,]    7    7    7    9   12
       [5,]    9    9    9    9   12
      
      A<-c(1,3,5,7,9)
      B<-c(0,3,6,9,12)
      C<-list(x=1, y=2)
      D<-function(x){x+1}
      
      > eapply(.GlobalEnv, is.function)
      $A
      [1] FALSE
      
      $B
      [1] FALSE
      
      $C
      [1] FALSE
      
      $D
      [1] TRUE 
      
      dataPoints <- matrix(4:15, nrow = 4)
      
      # Find means per column with `apply()`
      dataPoints_means <- apply(dataPoints, 2, mean)
      
      # Find standard deviation with `apply()`
      dataPoints_sdev <- apply(dataPoints, 2, sd)
      
      # Center the points 
      dataPoints_Trans1 <- sweep(dataPoints, 2, dataPoints_means,"-")
      
      # Return the result
      dataPoints_Trans1
      ##      [,1] [,2] [,3]
      ## [1,] -1.5 -1.5 -1.5
      ## [2,] -0.5 -0.5 -0.5
      ## [3,]  0.5  0.5  0.5
      ## [4,]  1.5  1.5  1.5
      
      # Normalize
      dataPoints_Trans2 <- sweep(dataPoints_Trans1, 2, dataPoints_sdev, "/")
      
      # Return the result
      dataPoints_Trans2
      ##            [,1]       [,2]       [,3]
      ## [1,] -1.1618950 -1.1618950 -1.1618950
      ## [2,] -0.3872983 -0.3872983 -0.3872983
      ## [3,]  0.3872983  0.3872983  0.3872983
      ## [4,]  1.1618950  1.1618950  1.1618950
      
      dapply(X, FUN, ..., MARGIN = 2, parallel = FALSE, mc.cores = 1L, 
             return = c("same", "matrix", "data.frame"), drop = TRUE)
      
      # Apply to columns:
      dapply(mtcars, log)
      dapply(mtcars, sum)
      dapply(mtcars, quantile)
      # Apply to rows:
      dapply(mtcars, sum, MARGIN = 1)
      dapply(mtcars, quantile, MARGIN = 1)
      # Return as matrix:
      dapply(mtcars, quantile, return = "matrix")
      dapply(mtcars, quantile, MARGIN = 1, return = "matrix")
      # Same for matrices ...
      
      BY(X, g, FUN, ..., use.g.names = TRUE, sort = TRUE,
         expand.wide = FALSE, parallel = FALSE, mc.cores = 1L,
         return = c("same", "matrix", "data.frame", "list"))
      
      # Vectors:
      BY(iris$Sepal.Length, iris$Species, sum)
      BY(iris$Sepal.Length, iris$Species, quantile)
      BY(iris$Sepal.Length, iris$Species, quantile, expand.wide = TRUE) # This returns a matrix 
      # Data.frames
      BY(iris[-5], iris$Species, sum)
      BY(iris[-5], iris$Species, quantile)
      BY(iris[-5], iris$Species, quantile, expand.wide = TRUE) # This returns a wider data.frame
      BY(iris[-5], iris$Species, quantile, return = "matrix") # This returns a matrix
      # Same for matrices ...
      
      fFUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE)
      
      v <- iris$Sepal.Length
      f <- iris$Species
      
      # Vectors
      fmean(v)             # mean
      fmean(v, f)          # grouped mean
      fsd(v, f)            # grouped standard deviation
      fsd(v, f, TRA = "/") # grouped scaling
      fscale(v, f)         # grouped standardizing (scaling and centering)
      fwithin(v, f)        # grouped demeaning
      
      w <- abs(rnorm(nrow(iris)))
      fmean(v, w = w)      # Weighted mean
      fmean(v, f, w)       # Weighted grouped mean
      fsd(v, f, w)         # Weighted grouped standard-deviation
      fsd(v, f, w, "/")    # Weighted grouped scaling
      fscale(v, f, w)      # Weighted grouped standardizing
      fwithin(v, f, w)     # Weighted grouped demeaning
      
      # Same using data.frames...
      fmean(iris[-5], f)                # grouped mean
      fscale(iris[-5], f)               # grouped standardizing
      fwithin(iris[-5], f)              # grouped demeaning
      
      # Same with matrices ...