R：具有最大值的子集/组数据帧？_R_Dataframe_Greatest N Per Group_Rdata

R：具有最大值的子集/组数据帧？

r dataframe

R：具有最大值的子集/组数据帧？,r,dataframe,greatest-n-per-group,rdata,R,Dataframe,Greatest N Per Group,Rdata,给定如下数据帧： gid set a b 1 1 1 1 9 2 1 2 -2 -3 3 1 3 5 6 4 2 2 -4 -7 5 2 6 5 10 6 2 9 2 0 我如何将唯一的gid的数据帧子集/分组为maxset值和1/0，无论其a值是否大于b值所以在这里，应该是，呃 1,3,0 2,9,1 在SQL中，这是一件愚蠢的简单事情，但我想更好地控制我的R，所以…使用dplyr非常简单： dat <- rea

给定如下数据帧：

  gid set  a  b
1   1   1  1  9
2   1   2 -2 -3
3   1   3  5  6
4   2   2 -4 -7
5   2   6  5 10
6   2   9  2  0

我如何将唯一的

gid

的数据帧子集/分组为max

set

值和1/0，无论其

值是否大于

值

所以在这里，应该是，呃

1,3,0
2,9,1

在SQL中，这是一件愚蠢的简单事情，但我想更好地控制我的R，所以…

使用

dplyr非常简单：
dat <- read.table(text="gid set  a  b
1   1  1  9
1   2 -2 -3
1   3  5  6
2   2 -4 -7
2   6  5 10
2   9  2  0", header=TRUE)

library(dplyr)

dat %>%
  group_by(gid) %>%
  filter(row_number() == which.max(set)) %>%
  mutate(greater=a>b) %>%
  select(gid, set, greater)

## Source: local data frame [2 x 3]
## Groups: gid
## 
##   gid set greater
## 1   1   3   FALSE
## 2   2   9    TRUE

您可以在没有管道的情况下执行相同的操作：
ungroup(
  select(
    mutate(
      filter(row_number() == which.max(set)), 
      greater=ifelse(a>b, 1, 0)), gid, set, greater))

但是……但是……为什么？！：-）
 这里有一个数据表
可能性，假设您的原始数据名为df

library(data.table)

setDT(df)[, .(set = max(set), b = as.integer(a > b)[set == max(set)]), gid]
#    gid set b
# 1:   1   3 0
# 2:   2   9 1

请注意，为了解释多个max（set）
行，我使用了set==max（set）
作为子集，这样将返回与max有联系的行数相同的行数（如果这有意义的话）
另一个数据表选项@thelatemail提供：
setDT(df)[, list(set = max(set), ab = (a > b)[which.max(set)] + 0), by = gid]
#    gid set ab
# 1:   1   3  0
# 2:   2   9  1

在base R
中，可以使用ave

indx <- with(df, ave(set, gid, FUN=max)==set)
#in cases of ties
#indx <- with(df, !!ave(set, gid, FUN=function(x) 
#                  which.max(x) ==seq_along(x)))


transform(df[indx,], greater=(a>b)+0)[,c(1:2,5)]
#   gid set greater
# 3   1   3       0
# 6   2   9       1

indx这是一个相当流行的“管道”习惯用法，最初由magrittr
软件包开始，但在“Hadleyverse”（即dplyr
，ggvis
，tidyr
等）中广泛使用。这个习惯用法并没有使用复杂的嵌套括号，而是将数据“管道”到函数进行处理，类似于D3JavaScript函数链的工作方式。感谢您提供的详细信息！如果集合的最大值存在关联，则可能会出现问题<代码>过滤器（row_number（）==which.max（set））可能更安全35; ty@RichardScriven。我的疏忽。答：更新。有趣的是，在我最近处理的netflow数据中，这并不是一个问题。绝对是一个需要注意的边缘情况。您可能会如何对结果进行排序，例如，设置数字升序？
indx <- with(df, ave(set, gid, FUN=max)==set)
#in cases of ties
#indx <- with(df, !!ave(set, gid, FUN=function(x) 
#                  which.max(x) ==seq_along(x)))


transform(df[indx,], greater=(a>b)+0)[,c(1:2,5)]
#   gid set greater
# 3   1   3       0
# 6   2   9       1