dplyr:汇总多个组的长格式
我知道有很多问题,可能在某种程度上听起来很相似,但我一直无法找到确切问题的答案 假设我们有一个玩具数据集dplyr:汇总多个组的长格式,r,dplyr,data.table,R,Dplyr,Data.table,我知道有很多问题,可能在某种程度上听起来很相似,但我一直无法找到确切问题的答案 假设我们有一个玩具数据集 library(tidyverse) df <- tibble( Gender = c("m", "f", "f", "m", "m", "f", "f", "f", "m", "f"), IQ = rnorm(10, 100, 15), Other = runif(10), Test = rnorm(10), group2 = c("A
library(tidyverse)
df <- tibble(
Gender = c("m", "f", "f", "m", "m",
"f", "f", "f", "m", "f"),
IQ = rnorm(10, 100, 15),
Other = runif(10),
Test = rnorm(10),
group2 = c("A", "A", "A", "A", "A",
"B", "B", "B", "B", "B")
)
得到
Variable Gender mean min max
<chr> <chr> <dbl> <dbl> <dbl>
1 IQ f 99.2 81.9 121.
2 IQ m 89.0 62.5 106.
3 Other f 0.301 0.187 0.479
4 Other m 0.395 0.0483 0.757
5 Test f -0.0770 -1.18 0.545
6 Test m 0.163 -0.632 0.828
但它返回的格式很宽(如data.table
)
性别IQ_平均其他_平均测试_平均IQ_最小其他_最小测试_最小IQ_最大
1 f 99.2 0.301-0.0770 81.9 0.187-1.18 121。
2米89.0.395 0.163 62.5 0.0483-0.632 106。
#…还有两个变量:其他最大值、测试最大值
当你有10个以上的变量时,这是非常无用的
我错过了什么 您可以首先将
df
转换为长格式,方法是在单个变量列中收集IQ
、Other
和Test
,然后计算每组的汇总统计数据(性别组2-变量):
库(tidyverse)
种子(1)
##资料
df%
聚集(key=“variable”、value=“value”、-c(性别,组2))%>%
分组依据(性别,第2组,变量)%>%
在(“值”处汇总,列表(平均值=平均值,最小值=最小值,最大值=最大值))%>%
解组()
#>#tibble:12 x 6
#>性别组2变量平均最小最大值
#>
#>1 f A IQ 95.187.5 103。
#>2 f其他0.432 0.212 0.652
#>3 f试验0.464-0.0162 0.944
#>4 f B IQ 100。87.7 111.
#>5 f B其他0.281 0.0134 0.386
#>6 f B试验0.599 0.0746 0.919
#>7米A智商106。90.6 124.
#>8米A其他0.442 0.126 0.935
#>9 m A试验0.457-0.0449 0.821
#>10米B智商109。109109
#>11米B其他0.870 0.870 0.870
#>12米B试验-1.99-1.99-1.99
您可以通过将聚集
、分离
和分散
添加到您自己的代码中来实现:
df%>%
组别(性别,组别2)%>%
如果是数字,则汇总(平均值=平均值,
最小=最小,
max=max))%>%
聚集(变量、VAL、-性别、-组2)%>%
单独(变量、c(“变量”、“状态”))%>%
价差(统计值、VAL)
####输出####
#一个tibble:12x6
#团体:性别[2]
性别组2变量最大平均最小值
1f智商110。10395
2 f其他0.934 0.469 0.00439
3 f试验1.39 0.472-0.446
4 f B IQ 121。92.0 75.6
5 f B其他0.730 0.461 0.261
6 f B试验0.589 0.276-0.524
7米A智商112。10494.3
8 m A其他0.827 0.613 0.308
9米A测试0.724 0.136-0.264
10米B智商115。115115
11MB其他0.970 0.970 0.970
12 m B试验-1.05-1.05-1.05
这里是一个数据表
library( data.table )
melt( setDT(df),
id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE)),
by = .(Gender, group2, variable )][]
# Gender group2 variable max min mean
# 1: m A IQ 120.739562935 83.46037366 96.99412720
# 2: f A IQ 98.657598754 98.43677811 98.54718843
# 3: f B IQ 111.973534436 71.38605822 94.04719457
# 4: m B IQ 102.913093964 102.91309396 102.91309396
# 5: m A Other 0.861929066 0.51651983 0.66098944
# 6: f A Other 0.752484881 0.07648229 0.41448359
# 7: f B Other 0.463524836 0.18308752 0.33301693
# 8: m B Other 0.099740011 0.09974001 0.09974001
# 9: m A Test 1.159379020 -0.83569116 0.04268551
# 10: f A Test -0.009017293 -0.77245300 -0.39073515
# 11: f B Test 1.591132150 -0.99248570 -0.24997246
# 12: m B Test 1.654489766 1.65448977 1.65448977
基准
#单位:毫秒
#expr最小lq平均uq最大neval
#数据表1.498788 1.819936 1.997320 1.980358 2.218809 2.413124 10
#tidyverse1 11.263956 11.887270 12.421442 11.963340 12.484075 15.401816 10
#tidyverse2 4.952477 5.185053 6.303103 6.001478 6.902558 9.663341 10
微基准::微基准(
数据表={
DT%
如果是数字,则汇总(平均值=平均值,
最小=最小,
max=max))%>%
聚集(变量、VAL、-性别、-组2)%>%
单独(变量、c(“变量”、“状态”))%>%
价差(统计值、VAL)
},
TidyVers2={
df%>%
聚集(key=“variable”、value=“value”、-c(性别,组2))%>%
分组依据(性别,第2组,变量)%>%
在(“值”处汇总,列表(平均值=平均值,最小值=最小值,最大值=最大值))%>%
解组()
},
次数=10
)
您知道如何使用这种方法在自己的函数中添加任意数量的组吗?…
似乎不起作用。编辑:最好只是提出一个新问题,谢谢!我不太清楚你指的是哪个函数的点,你能进一步澄清吗?见这里
df %>%
group_by(Gender) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max))
Gender IQ_mean Other_mean Test_mean IQ_min Other_min Test_min IQ_max
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 f 99.2 0.301 -0.0770 81.9 0.187 -1.18 121.
2 m 89.0 0.395 0.163 62.5 0.0483 -0.632 106.
# … with 2 more variables: Other_max <dbl>, Test_max <dbl>
library( data.table )
melt( setDT(df),
id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE)),
by = .(Gender, group2, variable )][]
# Gender group2 variable max min mean
# 1: m A IQ 120.739562935 83.46037366 96.99412720
# 2: f A IQ 98.657598754 98.43677811 98.54718843
# 3: f B IQ 111.973534436 71.38605822 94.04719457
# 4: m B IQ 102.913093964 102.91309396 102.91309396
# 5: m A Other 0.861929066 0.51651983 0.66098944
# 6: f A Other 0.752484881 0.07648229 0.41448359
# 7: f B Other 0.463524836 0.18308752 0.33301693
# 8: m B Other 0.099740011 0.09974001 0.09974001
# 9: m A Test 1.159379020 -0.83569116 0.04268551
# 10: f A Test -0.009017293 -0.77245300 -0.39073515
# 11: f B Test 1.591132150 -0.99248570 -0.24997246
# 12: m B Test 1.654489766 1.65448977 1.65448977
# Unit: milliseconds
# expr min lq mean median uq max neval
# data.table 1.498788 1.819936 1.997320 1.980358 2.218809 2.413124 10
# tidyverse1 11.263956 11.887270 12.421442 11.963340 12.484075 15.401816 10
# tidyverse2 4.952477 5.185053 6.303103 6.001478 6.902558 9.663341 10
microbenchmark::microbenchmark(
data.table = {
DT <- copy(df)
melt( setDT(DT),
id.vars = c("Gender", "group2") )[, .(max = max(value, na.rm = TRUE),
min = min(value, na.rm = TRUE),
mean = mean(value, na.rm = TRUE)),
by = .(Gender, group2, variable )][]
},
tidyverse1 = {
DT <- copy(df)
df %>%
group_by(Gender, group2) %>%
summarise_if(is.numeric, list(mean = mean,
min = min,
max = max)) %>%
gather(vars, vals, -Gender, -group2) %>%
separate(vars, c("Variable", "stat")) %>%
spread(stat, vals)
},
tidyverse2 = {
df %>%
gather(key = "variable", value = "value", -c(Gender, group2)) %>%
group_by(Gender, group2, variable) %>%
summarize_at("value", list(mean = mean, min = min, max = max)) %>%
ungroup()
},
times = 10
)