R 使用tapply按组对多个列求和_R_Tapply

R 使用tapply按组对多个列求和

R 使用tapply按组对多个列求和,r,tapply,R,Tapply,我想按组对各个列进行汇总，我的第一个想法是使用tapply。然而，我无法让tapply工作。tapply可以用于多列求和吗？若否，原因为何我在互联网上搜索了很多类似的问题早在2008年。然而，这些问题都没有得到直接回答。相反，回答总是建议使用不同的函数下面是一个示例数据集，我希望将各州的苹果和各州的樱桃相加和李子。下面，我已经编译了许多备选方案，以取代tapply 干活儿在底部，我展示了对tapply源代码的一个简单修改，该修改允许 t按执行所需操作然而，也许我忽略了一个简单的

我想按组对各个列进行汇总，我的第一个想法是使用

tapply

。然而，我无法让

tapply

工作。

tapply

可以用于多列求和吗？若否，原因为何

我在互联网上搜索了很多类似的问题早在2008年。然而，这些问题都没有得到直接回答。相反，回答总是建议使用不同的函数

下面是一个示例数据集，我希望将各州的苹果和各州的樱桃相加和李子。下面，我已经编译了许多备选方案，以取代

tapply

干活儿

在底部，我展示了对

tapply

源代码的一个简单修改，该修改允许

t按

执行所需操作

然而，也许我忽略了一个简单的方法来执行所需的操作使用

tapply

。我不寻找替代功能，尽管欢迎其他替代功能

鉴于我对

tapply

源代码的修改非常简单，我想知道为什么会这样，或者类似的情况尚未实施

谢谢你的建议。如果我的问题是重复的，我将很乐意发布我的答案作为对另一个问题的回答

以下是示例数据集：

df.1 <- read.table(text = '

    state   county   apples   cherries   plums
       AA        1        1          2       3
       AA        2       10         20      30
       AA        3      100        200     300
       BB        7       -1         -2      -3
       BB        8      -10        -20     -30
       BB        9     -100       -200    -300

', header = TRUE, stringsAsFactors = FALSE)

帮助页面显示：

tapply(X, INDEX, FUN = NULL, ..., simplify = TRUE)

X       an atomic object, typically a vector.

我被短语

弄糊涂了，它通常是一个向量，这让我想知道
可以使用数据帧。我从来都不清楚原子对象是什么意思
这里有几种可行的tapply
替代方案。第一个备选方案是将taply
与apply
相结合的变通方法
apply(df.1[,c(3:5)], 2, function(x) tapply(x, df.1$state, sum))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

with(df.1, aggregate(df.1[,3:5], data.frame(state), sum))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), colSums))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

t(sapply(split(df.1[,3:5], df.1$state), function(x) apply(x, 2, sum)))

#    apples cherries plums
# AA    111      222   333
# BB   -111     -222  -333

aggregate(df.1[,3:5], by=list(df.1$state), sum)

#   Group.1 apples cherries plums
# 1      AA    111      222   333
# 2      BB   -111     -222  -333

by(df.1[,3:5], df.1$state, colSums)

# df.1$state: AA
#   apples cherries    plums 
#      111      222      333 
# ------------------------------------------------------------ 
# df.1$state: BB
#   apples cherries    plums 
#     -111     -222     -333

with(df.1, 
     aggregate(x = list(apples   = apples, 
                        cherries = cherries,
                        plums    = plums), 
               by = list(state   = state), 
               FUN = function(x) sum(x)))

#   state apples cherries plums
# 1    AA    111      222   333
# 2    BB   -111     -222  -333

lapply(split(df.1, df.1$state), function(x) {colSums(x[,3:5])} )

# $AA
#   apples cherries    plums 
#      111      222      333 
#
# $BB
#   apples cherries    plums 
#     -111     -222     -333

以下是tapply
的源代码，只是我更改了行：
nx <- length(X)

tapply
用于向量，对于data.frame，您可以使用by
（这是tapply
的包装器，请查看代码）：
您正在通过
查找。它使用索引
的方式与您假设的tapply
的方式相同
by(df.1, df.1$state, function(x) colSums(x[,3:5]))

使用tapply
的问题在于，您正在按列索引data.frame
。（因为data.frame
实际上只是列的列表。）因此，tapply
抱怨您的索引与data.frame
的长度不匹配，即5。
我按照EDi的建议查看了by的源代码。该代码比我在tapply
中更改为一行的代码要复杂得多。我现在发现，my.tapply
不适用于下面更复杂的场景，苹果
和樱桃
由州
和县
相加。如果我使用my.tapply
处理此案例，我可以稍后在此处发布代码：
df.2 <- read.table(text = '

    state   county   apples   cherries   plums
       AA        1        1          2       3
       AA        1        1          2       3
       AA        2       10         20      30
       AA        2       10         20      30
       AA        3      100        200     300
       AA        3      100        200     300

       BB        7       -1         -2      -3
       BB        7       -1         -2      -3
       BB        8      -10        -20     -30
       BB        8      -10        -20     -30
       BB        9     -100       -200    -300
       BB        9     -100       -200    -300

', header = TRUE, stringsAsFactors = FALSE)

# my function works

   tapply(df.2$apples  , list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$apples  , list(df.2$state, df.2$county), function(x) {sum(x)})

# my function works

   tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})

# my function does not work

my.tapply(df.2[,3:4], list(df.2$state, df.2$county), function(x) {colSums(x)})

df.2
my.tapply <- function (X, INDEX, FUN = NULL, ..., simplify = TRUE)
{
    FUN <- if (!is.null(FUN)) match.fun(FUN)
    if (!is.list(INDEX)) INDEX <- list(INDEX)
    nI <- length(INDEX)
    if (!nI) stop("'INDEX' is of length zero")
    namelist <- vector("list", nI)
    names(namelist) <- names(INDEX)
    extent <- integer(nI)
    nx     <- ifelse(is.vector(X), length(X), dim(X)[1])  # replaces nx <- length(X)
    one <- 1L
    group <- rep.int(one, nx) #- to contain the splitting vector
    ngroup <- one
    for (i in seq_along(INDEX)) {
    index <- as.factor(INDEX[[i]])
    if (length(index) != nx)
        stop("arguments must have same length")
    namelist[[i]] <- levels(index)#- all of them, yes !
    extent[i] <- nlevels(index)
    group <- group + ngroup * (as.integer(index) - one)
    ngroup <- ngroup * nlevels(index)
    }
    if (is.null(FUN)) return(group)
    ans <- lapply(X = split(X, group), FUN = FUN, ...)
    index <- as.integer(names(ans))
    if (simplify && all(unlist(lapply(ans, length)) == 1L)) {
    ansmat <- array(dim = extent, dimnames = namelist)
    ans <- unlist(ans, recursive = FALSE)
    } else {
    ansmat <- array(vector("list", prod(extent)),
            dim = extent, dimnames = namelist)
    }
    if(length(index)) {
        names(ans) <- NULL
        ansmat[index] <- ans
    }
    ansmat
}

my.tapply(df.1$apples, df.1$state, function(x) {sum(x)})

#  AA   BB 
# 111 -111

my.tapply(df.1[,3:4] , df.1$state, function(x) {colSums(x)})

# $AA
#   apples cherries 
#      111      222 
#
# $BB
#   apples cherries 
#     -111     -222

> by(df.1[,c(3:5)], df.1$state, FUN=colSums)
df.1$state: AA
  apples cherries    plums 
     111      222      333 
------------------------------------------------------------------------------------- 
df.1$state: BB
  apples cherries    plums 
    -111     -222     -333 

by(df.1, df.1$state, function(x) colSums(x[,3:5]))

df.2 <- read.table(text = '

    state   county   apples   cherries   plums
       AA        1        1          2       3
       AA        1        1          2       3
       AA        2       10         20      30
       AA        2       10         20      30
       AA        3      100        200     300
       AA        3      100        200     300

       BB        7       -1         -2      -3
       BB        7       -1         -2      -3
       BB        8      -10        -20     -30
       BB        8      -10        -20     -30
       BB        9     -100       -200    -300
       BB        9     -100       -200    -300

', header = TRUE, stringsAsFactors = FALSE)

# my function works

   tapply(df.2$apples  , list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$apples  , list(df.2$state, df.2$county), function(x) {sum(x)})

# my function works

   tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})
my.tapply(df.2$cherries, list(df.2$state, df.2$county), function(x) {sum(x)})

# my function does not work

my.tapply(df.2[,3:4], list(df.2$state, df.2$county), function(x) {colSums(x)})