R 从分组数据中选择第一行和最后一行_R_Dplyr

R 从分组数据中选择第一行和最后一行

R 从分组数据中选择第一行和最后一行,r,dplyr,R,Dplyr,问题: 使用dplyr，如何在一条语句中选择分组数据的顶部和底部观察值/行数据与示例给定一个数据帧 df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), stopId=c("a","b","c","a","b","c","a","b","c"), stopSequence=c(1,2,3,3,1,4,3,1,2)) 我可以将这两个静态网络合并为一个，同时选择顶部和底部观测值吗？可能有

问题:

使用dplyr，如何在一条语句中选择分组数据的顶部和底部观察值/行

数据与示例

给定一个数据帧

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), 
                 stopId=c("a","b","c","a","b","c","a","b","c"), 
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

我可以将这两个静态网络合并为一个，同时选择顶部和底部观测值吗？

可能有一种更快的方法：

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  filter(row_number()==1 | row_number()==n())

比如：

library(dplyr)

df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
                 stopId=c("a","b","c","a","b","c","a","b","c"),
                 stopSequence=c(1,2,3,3,1,4,3,1,2))

first_last <- function(x) {
  bind_rows(slice(x, 1), slice(x, n()))
}

df %>%
  group_by(id) %>%
  arrange(stopSequence) %>%
  do(first_last(.)) %>%
  ungroup

## Source: local data frame [6 x 3]
## 
##   id stopId stopSequence
## 1  1      a            1
## 2  1      c            3
## 3  2      b            1
## 4  2      c            4
## 5  3      b            1
## 6  3      a            3

使用do，您几乎可以对组执行任意数量的操作，但@jeremycg的答案更适合于此任务。

不是dplyr，而是更直接地使用数据。表：

df <-  setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]

更详细的解释：

# 1) get row numbers of first/last observations from each group
#    * basically, we sort the table by id/stopSequence, then,
#      grouping by id, name the row numbers of the first/last
#      observations for each id; since this operation produces
#      a data.table
#    * .I is data.table shorthand for the row number
#    * here, to be maximally explicit, I've named the variable V1
#      as row_num to give other readers of my code a clearer
#      understanding of what operation is producing what variable
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id]
idx = first_last$row_num

# 2) extract rows by number
df[idx]

请务必查看wiki以获取数据。为了完整起见，我们介绍了表基础知识：您可以传递索引向量：

df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))

给

  id stopId stopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      b            1
6  3      a            3

我知道dplyr指定的问题。但是，由于其他人已经发布了使用其他软件包的解决方案，我决定也尝试使用其他软件包：

基本包：

df <- df[with(df, order(id, stopSequence, stopId)), ]
merge(df[!duplicated(df$id), ], 
      df[!duplicated(df$id, fromLast = TRUE), ], 
      all = TRUE)

输出：

  id stopId StopSequence
1  1      a            1
2  1      c            3
3  2      b            1
4  2      c            4
5  3      a            3
6  3      b            1

使用data.table：

df <-  setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]

另一种方法是使用lappy和dplyr语句。我们可以对同一语句应用任意数量的任何摘要函数：

lapply(c(first, last), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>% 
bind_rows()

例如，您可能对具有max STOPSERQUENCE值的行也感兴趣，并执行以下操作：

lapply(c(first, last, max("stopSequence")), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()

另一个不同的BaseR替代方案是按id和stopSequence进行一次排序，根据id对它们进行分割，对于每个id，我们只选择第一个和最后一个索引，并使用这些索引对数据帧进行子集划分

df[sapply(with(df, split(order(id, stopSequence), id)), function(x) 
                   c(x[1], x[length(x)])), ]


#  id stopId stopSequence
#1  1      a            1
#3  1      c            3
#5  2      b            1
#6  2      c            4
#8  3      b            1
#7  3      a            3

或类似使用

使用which.min和which.max：

基准

它也比当前接受的答案快得多，因为我们按组查找最小值和最大值，而不是对整个stopSequence列进行排序

# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F)) 
bench::mark(
  mm =df2 %>% 
    group_by(id) %>% 
    slice(c(which.min(stopSequence), which.max(stopSequence))),
  jeremy = df2 %>%
    group_by(id) %>%
    arrange(stopSequence) %>%
    filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 mm           22.6ms     27ms     34.9     14.2MB     21.3
#> 2 jeremy      254.3ms    273ms      3.66    58.4MB     11.0

没有考虑过编写函数-当然这是一种很好的方法来完成更复杂的事情。与仅使用切片相比，这似乎过于复杂了，例如df%>%arrangestopSequence%>%group_byid%>%slicec1，我并不反对，我在帖子中指出jeremycg是一个更好的答案，但这里有一个do示例可能会在slice不起作用时对其他人有所帮助，即在团队中进行更复杂的操作。而且，你应该把你的评论作为一个答案，这是最好的答案。或者df[df[orderstopSequence，.I[c1.N]，keyby=id]$V1]。看到id出现两次对我来说很奇怪。你可以在setDT调用中设置键。因此，这里不需要进行订单调用。@Artemkletsov-不过，您可能并不总是想设置键。或者df[orderstopSequence.SD[c1L.N]，by=id]。请参阅@JWilliman，这不一定完全相同，因为它不会在id上重新排序。我认为df[orderstopSequence，.SD[c1L，.N]，keyby=id]应该使用与上面的解决方案稍有不同的方法，结果将是%c1中的keyedrownumber%，n将无需运行矢量扫描twice@MichaelChirico我怀疑你漏掉了一个?？i、 e.filterrow_number%在%c1中，n可能比filter还要快-还没有测试过这一点，但请参见@Tjebo与filter不同，slice可以多次返回同一行，例如mtcars[1，]%>%slicec1，n因此从这个意义上讲，它们之间的选择取决于您想要返回的内容。我希望计时很接近，除非n非常大，而slice可能更受欢迎，但也没有进行测试。另请参阅

lapply(c(first, last), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>% 
bind_rows()

lapply(c(first, last, max("stopSequence")), 
       function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()

df[sapply(with(df, split(order(id, stopSequence), id)), function(x) 
                   c(x[1], x[length(x)])), ]


#  id stopId stopSequence
#1  1      a            1
#3  1      c            3
#5  2      b            1
#6  2      c            4
#8  3      b            1
#7  3      a            3

df[unlist(with(df, by(order(id, stopSequence), id, function(x) 
                   c(x[1], x[length(x)])))), ]

library(dplyr, warn.conflicts = F)
df %>% 
  group_by(id) %>% 
  slice(c(which.min(stopSequence), which.max(stopSequence)))

#> # A tibble: 6 x 3
#> # Groups:   id [3]
#>      id stopId stopSequence
#>   <dbl> <fct>         <dbl>
#> 1     1 a                 1
#> 2     1 c                 3
#> 3     2 b                 1
#> 4     2 c                 4
#> 5     3 b                 1
#> 6     3 a                 3

# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F)) 
bench::mark(
  mm =df2 %>% 
    group_by(id) %>% 
    slice(c(which.min(stopSequence), which.max(stopSequence))),
  jeremy = df2 %>%
    group_by(id) %>%
    arrange(stopSequence) %>%
    filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 mm           22.6ms     27ms     34.9     14.2MB     21.3
#> 2 jeremy      254.3ms    273ms      3.66    58.4MB     11.0