R 如何按组对变量求和
我有一个有两列的数据框。第一列包含“第一”、“第二”、“第三”等类别,第二列有数字,表示我从“类别”中看到特定组的次数 例如:R 如何按组对变量求和,r,dataframe,aggregate,r-faq,R,Dataframe,Aggregate,R Faq,我有一个有两列的数据框。第一列包含“第一”、“第二”、“第三”等类别,第二列有数字,表示我从“类别”中看到特定组的次数 例如: Category Frequency First 10 First 15 First 5 Second 2 Third 14 Third 20 Second 3 我想按类别对数据进行排序,并将所有频率相加: Category Frequency First
Category Frequency
First 10
First 15
First 5
Second 2
Third 14
Third 20
Second 3
我想按类别对数据进行排序,并将所有频率相加:
Category Frequency
First 30
Second 5
Third 34
在R中如何执行此操作?如果
x
是包含您的数据的数据帧,则以下操作将满足您的需要:
require(reshape)
recast(x, Category ~ ., fun.aggregate=sum)
只需添加第三个选项:
require(doBy)
summaryBy(Frequency~Category, data=yourdataframe, FUN=sum)
编辑:这是一个非常古老的答案。现在我建议使用
dplyr中的groupby
和summary
,如@docendo-answer中所述。使用聚合
:
aggregate(x$Frequency, by=list(Category=x$Category), FUN=sum)
Category x
1 First 30
2 Second 5
3 Third 34
在上面的示例中,可以在列表中指定多个维度。同一数据类型的多个聚合度量可以通过cbind
合并:
aggregate(cbind(x$Frequency, x$Metric2, x$Metric3) ...
(嵌入@thelatemail注释),aggregate
也有一个公式接口
aggregate(Frequency ~ Category, x, sum)
或者,如果要聚合多个列,可以使用
表示法(也适用于一个列)
或t轻轻地
:
tapply(x$Frequency, x$Category, FUN=sum)
First Second Third
30 5 34
使用这些数据:
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
x您还可以使用by()函数:
x2 <- by(x$Frequency, x$Category, sum)
do.call(rbind,as.list(x2))
x2rcs提供的答案有效且简单。但是,如果您正在处理较大的数据集,并且需要提高性能,则有一种更快的替代方案:
library(data.table)
data = data.table(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
data[, sum(Frequency), by = Category]
# Category V1
# 1: First 30
# 2: Second 5
# 3: Third 34
system.time(data[, sum(Frequency), by = Category] )
# user system elapsed
# 0.008 0.001 0.009
让我们使用data.frame和上述内容将其与同一事物进行比较:
data = data.frame(Category=c("First","First","First","Second","Third", "Third", "Second"),
Frequency=c(10,15,5,2,14,20,3))
system.time(aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum))
# user system elapsed
# 0.008 0.000 0.015
如果要保留该列,请使用以下语法:
data[,list(Frequency=sum(Frequency)),by=Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
对于较大的数据集,差异将变得更加明显,如下代码所示:
data = data.table(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( data[,sum(Frequency),by=Category] )
# user system elapsed
# 0.055 0.004 0.059
data = data.frame(Category=rep(c("First", "Second", "Third"), 100000),
Frequency=rnorm(100000))
system.time( aggregate(data$Frequency, by=list(Category=data$Category), FUN=sum) )
# user system elapsed
# 0.287 0.010 0.296
对于多个聚合,您可以按如下方式组合lappy
和.SD
data[, lapply(.SD, sum), by = Category]
# Category Frequency
# 1: First 30
# 2: Second 5
# 3: Third 34
您也可以为此目的使用dplyr软件包:
library(dplyr)
x %>%
group_by(Category) %>%
summarise(Frequency = sum(Frequency))
#Source: local data frame [3 x 2]
#
# Category Frequency
#1 First 30
#2 Second 5
#3 Third 34
或者,对于多个摘要列(也适用于一列):
下面是一些关于如何使用内置数据集mtcars
使用dplyr函数按组汇总数据的更多示例:
# several summary columns with arbitrary names
mtcars %>%
group_by(cyl, gear) %>% # multiple group columns
summarise(max_hp = max(hp), mean_mpg = mean(mpg)) # multiple summary columns
# summarise all columns except grouping columns using "sum"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), sum))
# summarise all columns except grouping columns using "sum" and "mean"
mtcars %>%
group_by(cyl) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# multiple grouping columns
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(everything(), list(mean = mean, sum = sum)))
# summarise specific variables, not all
mtcars %>%
group_by(cyl, gear) %>%
summarise(across(c(qsec, mpg, wt), list(mean = mean, sum = sum)))
# summarise specific variables (numeric columns except grouping columns)
mtcars %>%
group_by(gear) %>%
summarise(across(where(is.numeric), list(mean = mean, sum = sum)))
有关更多信息,包括%%>%%
运算符,请参阅几年后的
xtabs(Frequency ~ Category, df)
# Category
# First Second Third
# 30 5 34
或者,如果您想要返回数据帧
as.data.frame(xtabs(Frequency ~ Category, df))
# Category Freq
# 1 First 30
# 2 Second 5
# 3 Third 34
虽然我最近已经成为了大多数此类操作的dplyr
转换程序,但是sqldf
包在某些方面仍然非常好(而且更具可读性)
下面是如何使用sqldf
x <- data.frame(Category=factor(c("First", "First", "First", "Second",
"Third", "Third", "Second")),
Frequency=c(10,15,5,2,14,20,3))
sqldf("select
Category
,sum(Frequency) as Frequency
from x
group by
Category")
## Category Frequency
## 1 First 30
## 2 Second 5
## 3 Third 34
x使用cast
而不是recast
(注意“频率”
现在是“值”
)
您可以从包Rfast使用函数group.sum
Category <- Rfast::as_integer(Category,result.sort=FALSE) # convert character to numeric. R's as.numeric produce NAs.
result <- Rfast::group.sum(Frequency,Category)
names(result) <- Rfast::Sort(unique(Category)
# 30 5 34
Category我发现,当您需要在不同的列上应用不同的聚合函数时(并且您必须/想要坚持基本R),这非常有帮助(而且非常有效):
e、 g
鉴于这一投入:
DF <-
data.frame(Categ1=factor(c('A','A','B','B','A','B','A')),
Categ2=factor(c('X','Y','X','X','X','Y','Y')),
Samples=c(1,2,4,3,5,6,7),
Freq=c(10,30,45,55,80,65,50))
> DF
Categ1 Categ2 Samples Freq
1 A X 1 10
2 A Y 2 30
3 B X 4 45
4 B X 3 55
5 A X 5 80
6 B Y 6 65
7 A Y 7 50
结果:
> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1 A X 6 45
2 A Y 9 40
3 B X 7 50
6 B Y 6 65
另一种按矩阵或数据帧中的组返回求和的解决方案,它又短又快:
rowsum(x$Frequency, x$Category)
由于dplyr 1.0.0
,因此可以使用cross()
函数:
df %>%
group_by(Category) %>%
summarise(across(Frequency, sum))
Category Frequency
<chr> <int>
1 First 30
2 Second 5
3 Third 34
df%>%
组别(类别)%>%
总结(跨越(频率、总和))
类别频率
1前30名
2秒5
3/34
如果对多个变量感兴趣:
df %>%
group_by(Category) %>%
summarise(across(c(Frequency, Frequency2), sum))
Category Frequency Frequency2
<chr> <int> <int>
1 First 30 55
2 Second 5 29
3 Third 34 190
df%>%
组别(类别)%>%
总结(跨越(c(频率,频率2),总和))
类别频率2
1前30 55
2秒5 29
3第三组34 190
以及使用选择帮助器选择变量:
df %>%
group_by(Category) %>%
summarise(across(starts_with("Freq"), sum))
Category Frequency Frequency2 Frequency3
<chr> <int> <int> <dbl>
1 First 30 55 110
2 Second 5 29 58
3 Third 34 190 380
df%>%
组别(类别)%>%
总结(跨越(以(“频率”)开头)、总和)
类别频率频率2频率3
1前30 55 110
2秒52958
3第三组34 190 380
样本数据:
df <- read.table(text = "Category Frequency Frequency2 Frequency3
1 First 10 10 20
2 First 15 30 60
3 First 5 15 30
4 Second 2 8 16
5 Third 14 70 140
6 Third 20 120 240
7 Second 3 21 42",
header = TRUE,
stringsAsFactors = FALSE)
dflibrary(tidyverse)
x+1,但0.296对0.059并不特别令人印象深刻。data.table的数据大小需要远远大于300k行,并且包含3个以上的组,才能发挥作用。例如,我们将尝试支持20多亿行,因为一些data.table用户有250GB的RAM,GNU R现在支持长度>2^31.True。事实证明,我并没有那个么多RAM,只是想提供一些数据证据。table的卓越性能。我确信,如果有更多的数据,差异会更大。我有7 mil的观测数据dplyr需要0.3秒,aggregate()需要22秒才能完成操作。我本来打算把它贴在这个话题上的,但你抢先一步!有一种更短的方法来编写此数据[,sum(Frequency),by=Category]
。您可以使用.N
替代sum()
函数<代码>数据[,.N,by=Category]
。下面是一个有用的备忘:只有当Frequency列中的所有值都等于1时,才使用.N相当于sum(Frequency),因为.N统计每个聚合集(.SD)中的行数。但事实并非如此。与其他答案中提供的data.table和aggregate备选方案相比,它的速度有多快?@asieira,哪个最快,差异有多大(或者差异是否明显)将始终取决于您的数据大小。通常,对于大型数据集(例如某些GB),data.table最有可能是最快的。在较小的数据大小上,data.table和dplyr通常很接近,这也取决于组的数量。然而,数据、表和dplyr都将比基本函数快很多(对于某些操作来说,速度很可能是100-1000倍)。另请参见第二个示例中的“funs”指的是什么?@lauren.marietta您可以在的funs()
参数中指定要作为摘要应用的函数<
> DF2
Categ1 Categ2 GroupTotSamples GroupAvgFreq
1 A X 6 45
2 A Y 9 40
3 B X 7 50
6 B Y 6 65
rowsum(x$Frequency, x$Category)
df %>%
group_by(Category) %>%
summarise(across(Frequency, sum))
Category Frequency
<chr> <int>
1 First 30
2 Second 5
3 Third 34
df %>%
group_by(Category) %>%
summarise(across(c(Frequency, Frequency2), sum))
Category Frequency Frequency2
<chr> <int> <int>
1 First 30 55
2 Second 5 29
3 Third 34 190
df %>%
group_by(Category) %>%
summarise(across(starts_with("Freq"), sum))
Category Frequency Frequency2 Frequency3
<chr> <int> <int> <dbl>
1 First 30 55 110
2 Second 5 29 58
3 Third 34 190 380
df <- read.table(text = "Category Frequency Frequency2 Frequency3
1 First 10 10 20
2 First 15 30 60
3 First 5 15 30
4 Second 2 8 16
5 Third 14 70 140
6 Third 20 120 240
7 Second 3 21 42",
header = TRUE,
stringsAsFactors = FALSE)
library(tidyverse)
x <- data.frame(Category= c('First', 'First', 'First', 'Second', 'Third', 'Third', 'Second'),
Frequency = c(10, 15, 5, 2, 14, 20, 3))
count(x, Category, wt = Frequency)