R 如何按组对变量求和

R 如何按组对变量求和,r,dataframe,aggregate,r-faq,R,Dataframe,Aggregate,R Faq,我有一个有两列的数据框。第一列包含“第一”、“第二”、“第三”等类别,第二列有数字,表示我从“类别”中看到特定组的次数 例如: Category Frequency First 10 First 15 First 5 Second 2 Third 14 Third 20 Second 3 我想按类别对数据进行排序,并将所有频率相加: Category Frequency First

我有一个有两列的数据框。第一列包含“第一”、“第二”、“第三”等类别,第二列有数字,表示我从“类别”中看到特定组的次数

例如:

Category     Frequency
First        10
First        15
First        5
Second       2
Third        14
Third        20
Second       3
我想按类别对数据进行排序,并将所有频率相加:

Category     Frequency
First        30
Second       5
Third        34

在R中如何执行此操作?

如果
x
是包含您的数据的数据帧,则以下操作将满足您的需要:

require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)

只需添加第三个选项:

require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)

编辑:这是一个非常古老的答案。现在我建议使用
dplyr中的
groupby
summary
,如@docendo-answer中所述。

使用
聚合

aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
  Category  x
1    First 30
2   Second  5
3    Third 34

在上面的示例中,可以在
列表中指定多个维度。同一数据类型的多个聚合度量可以通过
cbind
合并:

aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...

(嵌入@thelatemail注释),
aggregate
也有一个公式接口

aggregate(Frequency ~ Category, x, sum)
或者,如果要聚合多个列,可以使用
表示法(也适用于一个列)


t轻轻地

tapply(x$Frequency, x$Category, FUN=sum)
 First Second  Third 
    30      5     34 

使用这些数据:

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                      "Third", "Third", "Second")), 
                    Frequency=c(10,15,5,2,14,20,3))
x您还可以使用by()函数:

x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))

x2rcs提供的答案有效且简单。但是,如果您正在处理较大的数据集,并且需要提高性能,则有一种更快的替代方案:

library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"), 
                  Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
#    Category V1
# 1:    First 30
# 2:   Second  5
# 3:    Third 34
system.time(data[, sum(Frequency), by = Category] )
# user    system   elapsed 
# 0.008     0.001     0.009 
让我们使用data.frame和上述内容将其与同一事物进行比较:

data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
                  Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user    system   elapsed 
# 0.008     0.000     0.015 
如果要保留该列,请使用以下语法:

data[,list(Frequency=sum(Frequency)),by=Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34
对于较大的数据集,差异将变得更加明显,如下代码所示:

data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
                  Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user    system   elapsed 
# 0.055     0.004     0.059 
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000), 
                  Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user    system   elapsed 
# 0.287     0.010     0.296 

对于多个聚合,您可以按如下方式组合
lappy
.SD

data[, lapply(.SD, sum), by = Category]
#    Category Frequency
# 1:    First        30
# 2:   Second         5
# 3:    Third        34

您也可以为此目的使用dplyr软件包:

library(dplyr)
x %>% 
  group_by(Category) %>% 
  summarise(Frequency = sum(Frequency))

#Source: local data frame [3 x 2]
#
#  Category Frequency
#1    First        30
#2   Second         5
#3    Third        34
或者,对于多个摘要列(也适用于一列):

下面是一些关于如何使用内置数据集
mtcars
使用dplyr函数按组汇总数据的更多示例:

# several summary columns with arbitrary names
mtcars %>% 
  group_by(cyl, gear) %>%                            # multiple group columns
  summarise(max_hp = max(hp), mean_mpg = mean(mpg))  # multiple summary columns

# summarise all columns except grouping columns using "sum" 
mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), sum))

# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>% 
  group_by(cyl) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))

# multiple grouping columns
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise(across(everything(), list(mean = mean, sum = sum)))

# summarise specific variables, not all
mtcars %>% 
  group_by(cyl, gear) %>% 
  summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))

# summarise specific variables (numeric columns except grouping columns)
mtcars %>% 
  group_by(gear) %>% 
  summarise(across(where(is.numeric), list(mean = mean, sum = sum)))

有关更多信息,包括
%%>%%
运算符,请参阅几年后的

xtabs(Frequency ~ Category, df)
# Category
# First Second  Third 
#    30      5     34 
或者,如果您想要返回
数据帧

as.data.frame(xtabs(Frequency ~ Category, df))
#   Category Freq
# 1    First   30
# 2   Second    5
# 3    Third   34

虽然我最近已经成为了大多数此类操作的
dplyr
转换程序,但是
sqldf
包在某些方面仍然非常好(而且更具可读性)

下面是如何使用
sqldf

x <- data.frame(Category=factor(c("First", "First", "First", "Second",
                                  "Third", "Third", "Second")), 
                Frequency=c(10,15,5,2,14,20,3))

sqldf("select 
          Category
          ,sum(Frequency) as Frequency 
       from x 
       group by 
          Category")

##   Category Frequency
## 1    First        30
## 2   Second         5
## 3    Third        34

x使用
cast
而不是
recast
(注意
“频率”
现在是
“值”


您可以从包Rfast使用函数
group.sum

Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
Category我发现,当您需要在不同的列上应用不同的聚合函数时(并且您必须/想要坚持基本R),这非常有帮助(而且非常有效):

e、 g

鉴于这一投入:

DF <-                
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
           Categ2=factor(c('X','Y','X','X','X','Y','Y')),
           Samples=c(1,2,4,3,5,6,7),
           Freq=c(10,30,45,55,80,65,50))

> DF
  Categ1 Categ2 Samples Freq
1      A      X       1   10
2      A      Y       2   30
3      B      X       4   45
4      B      X       3   55
5      A      X       5   80
6      B      Y       6   65
7      A      Y       7   50
结果:

> DF2
  Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65

另一种按矩阵或数据帧中的组返回求和的解决方案,它又短又快:

rowsum(x$Frequency, x$Category)

由于
dplyr 1.0.0
,因此可以使用
cross()
函数:

df %>%
 group_by(Category) %>%
 summarise(across(Frequency, sum))

  Category Frequency
  <chr>        <int>
1 First           30
2 Second           5
3 Third           34
df%>%
组别(类别)%>%
总结(跨越(频率、总和))
类别频率
1前30名
2秒5
3/34
如果对多个变量感兴趣:

df %>%
 group_by(Category) %>%
 summarise(across(c(Frequency, Frequency2), sum))

  Category Frequency Frequency2
  <chr>        <int>      <int>
1 First           30         55
2 Second           5         29
3 Third           34        190
df%>%
组别(类别)%>%
总结(跨越(c(频率,频率2),总和))
类别频率2
1前30 55
2秒5 29
3第三组34 190
以及使用选择帮助器选择变量:

df %>%
 group_by(Category) %>%
 summarise(across(starts_with("Freq"), sum))

  Category Frequency Frequency2 Frequency3
  <chr>        <int>      <int>      <dbl>
1 First           30         55        110
2 Second           5         29         58
3 Third           34        190        380
df%>%
组别(类别)%>%
总结(跨越(以(“频率”)开头)、总和)
类别频率频率2频率3
1前30 55 110
2秒52958
3第三组34 190 380
样本数据:

df <- read.table(text = "Category Frequency Frequency2 Frequency3
                 1    First        10         10         20
                 2    First        15         30         60
                 3    First         5         15         30
                 4   Second         2          8         16
                 5    Third        14         70        140
                 6    Third        20        120        240
                 7   Second         3         21         42",
                 header = TRUE,
                 stringsAsFactors = FALSE)
df
library(tidyverse)

x+1,但0.296对0.059并不特别令人印象深刻。data.table的数据大小需要远远大于300k行,并且包含3个以上的组,才能发挥作用。例如,我们将尝试支持20多亿行,因为一些data.table用户有250GB的RAM,GNU R现在支持长度>2^31.True。事实证明,我并没有那个么多RAM,只是想提供一些数据证据。table的卓越性能。我确信,如果有更多的数据,差异会更大。我有7 mil的观测数据dplyr需要0.3秒,aggregate()需要22秒才能完成操作。我本来打算把它贴在这个话题上的,但你抢先一步!有一种更短的方法来编写此
数据[,sum(Frequency),by=Category]
。您可以使用
.N
替代
sum()
函数<代码>数据[,.N,by=Category]
。下面是一个有用的备忘:只有当Frequency列中的所有值都等于1时,才使用.N相当于sum(Frequency),因为.N统计每个聚合集(.SD)中的行数。但事实并非如此。与其他答案中提供的data.table和aggregate备选方案相比,它的速度有多快?@asieira,哪个最快,差异有多大(或者差异是否明显)将始终取决于您的数据大小。通常,对于大型数据集(例如某些GB),data.table最有可能是最快的。在较小的数据大小上,data.table和dplyr通常很接近,这也取决于组的数量。然而,数据、表和dplyr都将比基本函数快很多(对于某些操作来说,速度很可能是100-1000倍)。另请参见第二个示例中的“funs”指的是什么?@lauren.marietta您可以在的
funs()
参数中指定要作为摘要应用的函数<
> DF2
  Categ1 Categ2 GroupTotSamples GroupAvgFreq
1      A      X               6           45
2      A      Y               9           40
3      B      X               7           50
6      B      Y               6           65
rowsum(x$Frequency, x$Category)
df %>%
 group_by(Category) %>%
 summarise(across(Frequency, sum))

  Category Frequency
  <chr>        <int>
1 First           30
2 Second           5
3 Third           34
df %>%
 group_by(Category) %>%
 summarise(across(c(Frequency, Frequency2), sum))

  Category Frequency Frequency2
  <chr>        <int>      <int>
1 First           30         55
2 Second           5         29
3 Third           34        190
df %>%
 group_by(Category) %>%
 summarise(across(starts_with("Freq"), sum))

  Category Frequency Frequency2 Frequency3
  <chr>        <int>      <int>      <dbl>
1 First           30         55        110
2 Second           5         29         58
3 Third           34        190        380
df <- read.table(text = "Category Frequency Frequency2 Frequency3
                 1    First        10         10         20
                 2    First        15         30         60
                 3    First         5         15         30
                 4   Second         2          8         16
                 5    Third        14         70        140
                 6    Third        20        120        240
                 7   Second         3         21         42",
                 header = TRUE,
                 stringsAsFactors = FALSE)
library(tidyverse)

x <- data.frame(Category= c('First', 'First', 'First', 'Second', 'Third', 'Third', 'Second'), 
           Frequency = c(10, 15, 5, 2, 14, 20, 3))

count(x, Category, wt = Frequency)