R：基于系数或数值的聚合_R_Aggregate

R：基于系数或数值的聚合

R：基于系数或数值的聚合,r,aggregate,R,Aggregate,我试图聚合一些数据，这些数据既是数值变量又是因子变量。如果变量是数字，我想要平均值。如果它是一个因子，我希望它是最常出现的值。我已经编写了以下函数，但没有得到我想要的输出： meanOrMostFreq <- function(x){ if(class(x) == 'factor'){ tbl <- as.data.frame(table(x)) tbl$Var1 <- as.character(tbl$Var1) return(tbl[tbl$

我试图聚合一些数据，这些数据既是数值变量又是因子变量。如果变量是数字，我想要平均值。如果它是一个因子，我希望它是最常出现的值。我已经编写了以下函数，但没有得到我想要的输出：

meanOrMostFreq <- function(x){
    if(class(x) == 'factor'){
    tbl <- as.data.frame(table(x))
    tbl$Var1 <- as.character(tbl$Var1)
    return(tbl[tbl$Freq == max(tbl$Freq),'Var1'][1])
    }
    if(class(x) == 'numeric'){
    meanX <- mean(x, na.rm = TRUE)
    return(meanX)
    }
}

我希望在最后一列中得到一个实际的字母，而不是一个数字。关于我做错了什么有什么建议吗？

这里有一种使用

数据的方法。表

library(data.table)
setDT(df1)[ ,lapply(.SD, function(x) if(is.numeric(x)) mean(x, na.rm=TRUE) else
          names(which.max(table(x)))) , by=Species]

#         Species Sepal.Length Sepal.Width Petal.Length Petal.Width letter1
#1:     setosa     5.006000    3.428000     1.462000       0.246       a
#2: versicolor     5.936000    2.770000     4.260000       1.326       c
#3:  virginica     6.610417    2.964583     5.564583       2.025       a

通过公式界面进行聚合显然会丢失其作为“一个因素”的元数据；这对我很有用：

> meanOrMostFreq
function(x){
    if(class(x) == 'factor'){
    return(  names(which.max(table(x))) )
    }
    if(class(x) == 'numeric'){
    meanX <- mean(x, na.rm = TRUE)
    return(meanX)
    }
}
> aggregate(df1[-5], df1[5], meanOrMostFreq)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width letter1
1     setosa     5.006000    3.428000     1.462000       0.246       a
2 versicolor     5.936000    2.770000     4.260000       1.326       c
3  virginica     6.610417    2.964583     5.564583       2.025       a

>meanOrMostFreq
功能（x）{
如果（类别（x）=‘系数’）{
返回（名称（which.max（表（x）））
}
如果（类（x）=‘数值’）{
平均值聚合（df1[-5]，df1[5]，平均值MOSTFREQ）
种萼片。长萼片。宽花瓣。长花瓣。宽字母1
1 setosa 5.006000 3.428000 1.462000 0.246 a
2花色5.936000 2.770000 4.260000 1.326 c
3弗吉尼亚州6.610417 2.964583 5.564583 2.025 a

由于

aggregate.formula

和

aggregate.data.frame

的行为不同，我觉得这是一个bug。

使用

plyr

包的替代方案：

ddply(df1, .(Species), function(df) {
    sapply(df, meanOrMostFreq)
})

[]的

非常确定

聚合

不会使用非数字。您可能只需要一个不同的工具。您可以通过

数据.table

轻松实现所需的功能。无论如何，您的

平均值或mostfreq

中存在一些错误。首先，它应该是

as.data.frame（table（x））

。然后生成的列将被命名为

，而不是

Var

。在

aggregate

调用中看不到这些错误，因为它强制为数字。只需尝试

meanOrMostFreq（df1$letters）

@nicola:表（x）上的内容是正确的var vs Var1。我已经做了更改，但仍然得到相同的错误。

> meanOrMostFreq
function(x){
    if(class(x) == 'factor'){
    return(  names(which.max(table(x))) )
    }
    if(class(x) == 'numeric'){
    meanX <- mean(x, na.rm = TRUE)
    return(meanX)
    }
}
> aggregate(df1[-5], df1[5], meanOrMostFreq)
     Species Sepal.Length Sepal.Width Petal.Length Petal.Width letter1
1     setosa     5.006000    3.428000     1.462000       0.246       a
2 versicolor     5.936000    2.770000     4.260000       1.326       c
3  virginica     6.610417    2.964583     5.564583       2.025       a

ddply(df1, .(Species), function(df) {
    sapply(df, meanOrMostFreq)
})