R 以长格式聚合数据时的正确计算方法
对于简单的数据帧:R 以长格式聚合数据时的正确计算方法,r,aggregate,dcast,R,Aggregate,Dcast,对于简单的数据帧: client_id<-c("111","111","111","112","113","113","114") transactions<-c(1,2,2,2,3,17,100) transactions_sum<-c(5,5,5,2,20,20,100) ##precalculated sums of transaction counts for each client_id segment<-c("low","low","low","low","l
client_id<-c("111","111","111","112","113","113","114")
transactions<-c(1,2,2,2,3,17,100)
transactions_sum<-c(5,5,5,2,20,20,100) ##precalculated sums of transaction counts for each client_id
segment<-c("low","low","low","low","low","low","high")
test<-data.frame(client_id,transactions,transactions_sum,segment)
client_id transactions transactions_sum segment
1 111 1 5 low
2 111 2 5 low
3 111 2 5 low
4 112 2 2 low
5 113 3 20 low
6 113 17 20 low
7 114 100 100 high
由于计算平均值时应考虑重复的客户ID,因此我们应将每个细分的单个交易计数相加(低细分为1+2+2+2+3+17),然后除以唯一的客户ID(低细分为3),低细分为27/3=9。为每个客户id使用预先计算的金额:(5+2+20)/3=9
然而,当我试图对这些数据运行“dcast”或“aggregate”时,我得到了错误的数字,因为它们显然将每一行视为一个唯一的观察值:
dcast(test, segment ~ ., mean, value.var="transactions")
给予
这有效地告诉我们,它将每个段的事务计数相加(低段为1+2+2+2+3+17),然后除以每个段的观察数(低段为6),而不是唯一的客户端ID
在这种情况下,计算平均值的正确方法是什么?我们可以使用
数据。表
library(data.table)
setDT(test)[, .(transactions_mean = sum(transactions)/uniqueN(client_id)), by = segment]
# segment transactions_mean
#1: low 9
#2: high 100
您可以使用以下选项:
meanLow <- mean(test$segment == "low")
meanHigh <- mean(test$segment == "high")
meanLow您也可以使用dplyr
library(dplyr)
test_2 <- test %>%
group_by(segment) %>%
summarise (meanTransactions=sum(transactions)/n_distinct(client_id))
test_2
# A tibble: 2 × 2
segment transactions
<chr> <dbl>
1 high 100
2 low 9
库(dplyr)
测试2%
分组单位(分部)%>%
摘要(平均交易额=总和(交易额)/n不同(客户id))
测试2
#一个tibble:2×2
分部交易
1高100
2低9
谢谢,这很有效。以前我尝试过使用dcast(test,segment~,fun.aggregate=function(x)(sum(x)/length(unique(client_id))),value.var=“transactions”)但没有成功。好像我需要使用另一个图书馆。
meanLow <- mean(test$segment == "low")
meanHigh <- mean(test$segment == "high")
library(dplyr)
test_2 <- test %>%
group_by(segment) %>%
summarise (meanTransactions=sum(transactions)/n_distinct(client_id))
test_2
# A tibble: 2 × 2
segment transactions
<chr> <dbl>
1 high 100
2 low 9