如何使用Dplyr';s summary和查找最小/最大值的()命令
我有以下数据:如何使用Dplyr';s summary和查找最小/最大值的()命令,r,dplyr,R,Dplyr,我有以下数据: Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed") Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43) Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,12,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
这一点很有效:
Group minAge minAgeName maxAge maxAgeName
1 A 22 Sam 22 Sam
2 B 12 Sarah 58 James
3 C 17 Andrew 82 Sally
4 D 12 Mairin 67 Ray
但是,如果存在多个最小值或最大值,则会出现问题:
Name <- c("Sam", "Sarah", "Jim", "Fred", "James", "Sally", "Andrew", "John", "Mairin", "Kate", "Sasha", "Ray", "Ed")
Age <- c(22,31,31,35,58,82,17,34,12,24,44,67,43)
Group <- c("A", "B", "B", "B", "B", "C", "C", "D", "D", "D", "D", "D", "D")
data <- data.frame(Name, Age, Group)
> data %>% group_by(Group) %>%
+ summarize(minAge = min(Age), minAgeName = Name[which(Age == min(Age))],
+ maxAge = max(Age), maxAgeName = Name[which(Age == max(Age))])
Error: expecting a single value
Name您可以使用which.min
和which.max
获取第一个值
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which.min(Age)],
maxAge = max(Age), maxAgeName = Name[which.max(Age)])
要获取所有值,请使用例如粘贴适当的collapse
参数
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "),
maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
您可以使用which.min
和which.max
获取第一个值
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = Name[which.min(Age)],
maxAge = max(Age), maxAgeName = Name[which.max(Age)])
要获取所有值,请使用例如粘贴适当的collapse
参数
data %>% group_by(Group) %>%
summarize(minAge = min(Age), minAgeName = paste(Name[which(Age == min(Age))], collapse = ", "),
maxAge = max(Age), maxAgeName = paste(Name[which(Age == max(Age))], collapse = ", "))
实际上,我建议您以“长”格式保存数据。以下是我的做法:
library(dplyr)
存在连接时保留所有值:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
存在连接时仅保留一个值:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
如果您确实想要一个“广泛”的数据集,基本概念是使用“tidyr”收集
和传播
数据:
虽然还不清楚您希望领带采用什么样的宽幅格式。实际上,我建议您将数据保留为“长”格式。以下是我的做法:
library(dplyr)
存在连接时保留所有值:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
存在连接时仅保留一个值:
data %>%
group_by(Group) %>%
arrange(Age) %>% ## optional
filter(Age %in% range(Age))
# Source: local data frame [8 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 Jim 31 B
# 4 James 58 B
# 5 Andrew 17 C
# 6 Sally 82 C
# 7 Mairin 12 D
# 8 Ray 67 D
data %>%
group_by(Group) %>%
arrange(Age) %>%
slice(if (length(Age) == 1) 1 else c(1, n())) ## maybe overkill?
# Source: local data frame [7 x 3]
# Groups: Group
#
# Name Age Group
# 1 Sam 22 A
# 2 Sarah 31 B
# 3 James 58 B
# 4 Andrew 17 C
# 5 Sally 82 C
# 6 Mairin 12 D
# 7 Ray 67 D
如果您确实想要一个“广泛”的数据集,基本概念是使用“tidyr”收集
和传播
数据:
虽然还不清楚您希望领带采用何种宽幅形式。以下是一些数据。表方法,第一种是从@akrun借用的:
setDT(data)
# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
# Group min.Name min.Age max.Name max.Age
# 1: A Sam 22 Sam 22
# 2: B Sarah 31 James 58
# 3: C Andrew 17 Sally 82
# 4: D Mairin 12 Ray 67
# show all, long format
data[,{
mina=min(Age)
maxa=max(Age)
rbind(
data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
# Group minmax Age Name
# 1: A min 22 Sam
# 2: A max 22 Sam
# 3: B min 31 Sarah
# 4: B min 31 Jim
# 5: B max 58 James
# 6: C min 17 Andrew
# 7: C max 82 Sally
# 8: D min 12 Mairin
# 9: D max 67 Ray
我认为长格式是最好的,因为它允许您使用minmax
进行过滤,但是代码非常复杂且效率低下
以下是一些不太好的方法:
# show all, wide format (with a list column)
data[,{
mina=min(Age)
maxa=max(Age)
list(
minAge=mina,
maxAge=maxa,
minNames=list(Name[Age==mina]),
maxNames=list(Name[Age==maxa]))
},by=Group]
# Group minAge maxAge minNames maxNames
# 1: A 22 22 Sam Sam
# 2: B 31 58 Sarah,Jim James
# 3: C 17 82 Andrew Sally
# 4: D 12 67 Mairin Ray
# show all, wide format (with a string column)
# (just look at @shadow's answer)
以下是一些数据。表方法,第一种是从@akrun借用的:
setDT(data)
# show one, wide format
data[,c(min=.SD[which.min(Age)],max=.SD[which.max(Age)]),by=Group]
# Group min.Name min.Age max.Name max.Age
# 1: A Sam 22 Sam 22
# 2: B Sarah 31 James 58
# 3: C Andrew 17 Sally 82
# 4: D Mairin 12 Ray 67
# show all, long format
data[,{
mina=min(Age)
maxa=max(Age)
rbind(
data.table(minmax="min",Age=mina,Name=Name[which(Age==mina)]),
data.table(minmax="max",Age=maxa,Name=Name[which(Age==maxa)])
)},by=Group]
# Group minmax Age Name
# 1: A min 22 Sam
# 2: A max 22 Sam
# 3: B min 31 Sarah
# 4: B min 31 Jim
# 5: B max 58 James
# 6: C min 17 Andrew
# 7: C max 82 Sally
# 8: D min 12 Mairin
# 9: D max 67 Ray
我认为长格式是最好的,因为它允许您使用minmax
进行过滤,但是代码非常复杂且效率低下
以下是一些不太好的方法:
# show all, wide format (with a list column)
data[,{
mina=min(Age)
maxa=max(Age)
list(
minAge=mina,
maxAge=maxa,
minNames=list(Name[Age==mina]),
maxNames=list(Name[Age==maxa]))
},by=Group]
# Group minAge maxAge minNames maxNames
# 1: A 22 22 Sam Sam
# 2: B 31 58 Sarah,Jim James
# 3: C 17 82 Andrew Sally
# 4: D 12 67 Mairin Ray
# show all, wide format (with a string column)
# (just look at @shadow's answer)
which.min
和which.max
取第一个值。这对于第一个解决方案非常有效,谢谢!使用数据表
<代码>设置数据[,c(.SD[which.min(Age)],.SD[which.max(Age)],组]
并更改名称accordingly@akrun一个.SD
:setDT(数据)[,.SD[c(哪个.min(年龄),哪个.max(年龄))],组]
。类似地,选择行的事情在这里也起作用:setDT(data)[data[,.I[c(which.min(Age),which.max(Age))],Group]$V1]
。不过,这只是问题的前半部分(以防有人回答)。@Frank使用c
,这是一个很好的解决方案,但你的解决方案很长,不是吗。which.min
和which.max
取第一个值。这对于第一个解决方案非常有效,谢谢!使用数据表
<代码>设置数据[,c(.SD[which.min(Age)],.SD[which.max(Age)],组]
并更改名称accordingly@akrun一个.SD
:setDT(数据)[,.SD[c(哪个.min(年龄),哪个.max(年龄))],组]
。类似地,选择行的事情在这里也起作用:setDT(data)[data[,.I[c(which.min(Age),which.max(Age))],Group]$V1]
。尽管如此,这只是问题的前半部分(以防有人回答)。@Frank使用c
,这是一个很好的答案,但你的解决方案很长,不是吗。不过,与OP中的每组一行的最大和最小列相比,阅读起来(或者使用起来)有些困难。@Frank,您将如何处理多个案例?对我来说,粘贴在一起似乎更难。@Frank,远远领先于你:-)@Frank,最小值的想法已经嵌入到我的最后一个答案中。只需停在mutate
行。当有倍数时,您可以使用densite\u rank
和factor
。非常感谢!我特别喜欢%范围内的筛选器%的简单性。不过,与OP中的“最大”列和“最小”列的每组一行相比,读取(或者使用)有些困难。@Frank,您如何处理多种情况?对我来说,粘贴在一起似乎更难。@Frank,远远领先于你:-)@Frank,最小值的想法已经嵌入到我的最后一个答案中。只需停在mutate
行。当有倍数时,您可以使用densite\u rank
和factor
。非常感谢!我特别喜欢%in%范围的过滤器解决方案的简单性。对于第二组代码,您是否使用了第二个示例数据(我得到8行作为输出)@akrun Good catch。我已经切换到我的原始代码,明显低效的代码确实产生了九行。我能想到的唯一选择是同样低效的选择,将ifelse
放在代码的前面。我认为,dplyr
赢得了这一轮,因为在我的第二个表格中产生结果并不难(正如阿南达在评论中解释的那样)。非常好,谢谢!在这个例子中,我没有使用data.table,因为性能并不重要,但我很欣赏这篇文章。自我/潜在编辑注意:第二个块中丑陋的代码可以用frank()
函数清除。对于第二组代码,您是否使用了第二个示例数据(我得到了8行作为输出)@akrun Good catch。我已经切换到我的原始代码,明显低效的代码确实产生了九行。我能想到的唯一选择是同样低效的选择,将ifelse
放在代码的前面。我认为,dplyr
赢得了这一轮,因为在我的第二个表格中产生结果并不难(正如阿南达在评论中解释的那样)。非常好,谢谢!在这个例子中,我没有使用data.table,因为性能并不重要,但我很欣赏这篇文章。自我/潜在编辑注意:第二个块中丑陋的代码可以用frank()
函数来清理。再次感谢您——这也正是我想要的。我不清楚我想要的数据格式是什么。这对客户来说效果很好