给定列的最小值,在其他列中查找最小值(dplyr)
假设R中有以下数据集:给定列的最小值,在其他列中查找最小值(dplyr),r,dplyr,tidyverse,R,Dplyr,Tidyverse,假设R中有以下数据集: > td Type Rep Value1 Value2 1 A 1 7 1 2 A 2 5 4 3 A 3 5 3 4 A 4 8 2 5 B 1 5 10 6 B 2 6 1 7 B 3 7 1 8 C 1 8 13 9
> td
Type Rep Value1 Value2
1 A 1 7 1
2 A 2 5 4
3 A 3 5 3
4 A 4 8 2
5 B 1 5 10
6 B 2 6 1
7 B 3 7 1
8 C 1 8 13
9 C 2 8 13
> td <- structure(list(Type = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L,
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Rep = c(1L,
2L, 3L, 4L, 1L, 2L, 3L, 1L, 2L), Value1 = c(7L, 5L, 5L, 8L, 5L,
6L, 7L, 8L, 8L), Value2 = c(1L, 4L, 3L, 2L, 10L, 1L, 1L, 13L,
13L)), .Names = c("Type", "Rep", "Value1", "Value2"), class = "data.frame",
row.names = c(NA, -9L))
在此表中,数据按类型汇总。列MinValue1是特定类型的最小值,列MinValue2是Value2的最小值,给定列Value1的最小值。列平均值*是所有观测值的总平均值
一种方法是实现循环,循环遍历每种类型并进行计算。然而,我正在寻找一种更好/简单/美观的方法来执行这种操作
我玩过tidyverse的工具:
> library(tidyverse)
> td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
# A tibble: 3 × 4
Type MinValue1 MeanValue1 MeanValue2
<fctr> <int> <dbl> <dbl>
1 A 5 6.25 2.5
2 B 5 6.00 4.0
3 C 8 8.00 13.0
请注意,这里没有列MinValue2。还要注意,总结…,MinValue2=MinValue2。。。由于此解决方案对一种类型的所有观测值取最小值,因此不起作用
我们可以使用slice,然后合并结果:
> td %>% group_by(Type) %>% slice(which.min(Value1))
Source: local data frame [3 x 4]
Groups: Type [3]
Type Rep Value1 Value2
<fctr> <int> <int> <int>
1 A 3 5 4
2 B 1 5 10
3 C 1 8 13
但是请注意,slice工具在这里对我们没有帮助:类型A,Value1 5在slice返回时应该有Value2==3,而不是==4
那么,你们有没有一种优雅的方式来实现我所追求的结果?谢谢 按“类型”分组后,根据选择与最小值“Value1”相对应的元素,创建另一个最小值为“Value2”的组,使用Summary_获得所选列“Value1”和“Value2”的最小值和平均值,并使用select删除“Value2_min”
一种方法是使用order函数的属性断开与另一个向量的联系:
get_min_at_min <- function(vec1, vec2) {
return(vec2[order(vec1, vec2)[1]])
}
或者简单地使用一个事实,即可以在dplyr函数中处理计算变量:
td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MinValue2 = min(Value2[Value1 == MinValue1]),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
非常感谢@evgeniC和@akrun。你的帮助很有价值。就我的目的/数据集而言,这两种解决方案都非常有效。因此,为了让讨论更加丰富,我运行了一些实验来测试这些建议的速度,使用以下脚本,当然还有对每个实验的注释/取消注释:
library(tidyverse)
args <- commandArgs(TRUE)
set.seed(args[1])
n = args[2]
td = data.frame(Type = sample(LETTERS, n, replace=T),
Value1 = sample(1:100, n, replace=T),
Value2 = sample(1:100, n, replace=T))
ptm <- proc.time()
# Solution 1 ###
#get_min_at_min <- function(vec1, vec2) {
#return(vec2[order(vec1, vec2)[1]])
#}
#tmp <- td %>%
#group_by(Type) %>%
#summarise(MinValue1 = min(Value1),
#MinValue2 = get_min_at_min(Value1, Value2),
#MeanValue1 = mean(Value1),
#MeanValue2 = mean(Value2))
### Solution 2 ###
tmp <- td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MinValue2 = min(Value2[Value1 == MinValue1]),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
### Solution 3 ###
#tmp <- td %>%
#group_by(Type) %>%
#group_by(MinValue2 = min(Value2[Value1==min(Value1)]), add=TRUE) %>%
#summarise_each(funs(min, mean), Value1:Value2) %>%
#select(-Value2_min)
print(proc.time() - ptm)
使用
我们得到了以下结果:
Alg User_mean System_mean Elapsed_mean User_sd System_sd Elapsed_sd
1 akrun 1.3643333 0.13766667 1.510333 0.01069268 0.005033223 0.02050203
2 evgeniC1 0.8706667 0.07466667 0.951000 0.03323151 0.003055050 0.04073082
3 evgeniC2 0.8600000 0.09300000 0.958000 0.05546170 0.005196152 0.06331666
因此,我倾向于使用@evgeniC的解决方案2,因为它是最优雅/简单的,并且与解决方案1一样快@akrun提出了一个很好的解决方案,但它有点复杂和缓慢。无论如何,该设置在其他情况下也很有用。非常感谢。最后一个选项是我要找的。@akrun的答案在输入数据中有很多列的情况下更好:这样可以节省键入时间。此外,我还推荐使用微基准测试性能的方法,例如查看
td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MinValue2 = min(Value2[Value1 == MinValue1]),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
library(tidyverse)
args <- commandArgs(TRUE)
set.seed(args[1])
n = args[2]
td = data.frame(Type = sample(LETTERS, n, replace=T),
Value1 = sample(1:100, n, replace=T),
Value2 = sample(1:100, n, replace=T))
ptm <- proc.time()
# Solution 1 ###
#get_min_at_min <- function(vec1, vec2) {
#return(vec2[order(vec1, vec2)[1]])
#}
#tmp <- td %>%
#group_by(Type) %>%
#summarise(MinValue1 = min(Value1),
#MinValue2 = get_min_at_min(Value1, Value2),
#MeanValue1 = mean(Value1),
#MeanValue2 = mean(Value2))
### Solution 2 ###
tmp <- td %>%
group_by(Type) %>%
summarise(MinValue1 = min(Value1),
MinValue2 = min(Value2[Value1 == MinValue1]),
MeanValue1 = mean(Value1),
MeanValue2 = mean(Value2))
### Solution 3 ###
#tmp <- td %>%
#group_by(Type) %>%
#group_by(MinValue2 = min(Value2[Value1==min(Value1)]), add=TRUE) %>%
#summarise_each(funs(min, mean), Value1:Value2) %>%
#select(-Value2_min)
print(proc.time() - ptm)
$ Rscript test.R 270001 10000000
> td %>% group_by(Alg) %>% summarise_each(funs(mean, sd), User:Elapsed)
Alg User_mean System_mean Elapsed_mean User_sd System_sd Elapsed_sd
1 akrun 1.3643333 0.13766667 1.510333 0.01069268 0.005033223 0.02050203
2 evgeniC1 0.8706667 0.07466667 0.951000 0.03323151 0.003055050 0.04073082
3 evgeniC2 0.8600000 0.09300000 0.958000 0.05546170 0.005196152 0.06331666