R 从分组数据中选择第一行和最后一行
问题: 使用dplyr,如何在一条语句中选择分组数据的顶部和底部观察值/行 数据与示例 给定一个数据帧R 从分组数据中选择第一行和最后一行,r,dplyr,R,Dplyr,问题: 使用dplyr,如何在一条语句中选择分组数据的顶部和底部观察值/行 数据与示例 给定一个数据帧 df <- data.frame(id=c(1,1,1,2,2,2,3,3,3), stopId=c("a","b","c","a","b","c","a","b","c"), stopSequence=c(1,2,3,3,1,4,3,1,2)) 我可以将这两个静态网络合并为一个,同时选择顶部和底部观测值吗?可能有
df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
stopSequence=c(1,2,3,3,1,4,3,1,2))
我可以将这两个静态网络合并为一个,同时选择顶部和底部观测值吗?可能有一种更快的方法:
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n())
比如:
library(dplyr)
df <- data.frame(id=c(1,1,1,2,2,2,3,3,3),
stopId=c("a","b","c","a","b","c","a","b","c"),
stopSequence=c(1,2,3,3,1,4,3,1,2))
first_last <- function(x) {
bind_rows(slice(x, 1), slice(x, n()))
}
df %>%
group_by(id) %>%
arrange(stopSequence) %>%
do(first_last(.)) %>%
ungroup
## Source: local data frame [6 x 3]
##
## id stopId stopSequence
## 1 1 a 1
## 2 1 c 3
## 3 2 b 1
## 4 2 c 4
## 5 3 b 1
## 6 3 a 3
使用do,您几乎可以对组执行任意数量的操作,但@jeremycg的答案更适合于此任务。不是dplyr,而是更直接地使用数据。表:
df <- setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]
更详细的解释:
# 1) get row numbers of first/last observations from each group
# * basically, we sort the table by id/stopSequence, then,
# grouping by id, name the row numbers of the first/last
# observations for each id; since this operation produces
# a data.table
# * .I is data.table shorthand for the row number
# * here, to be maximally explicit, I've named the variable V1
# as row_num to give other readers of my code a clearer
# understanding of what operation is producing what variable
first_last = df[order(id, stopSequence), .(row_num = .I[c(1L,.N)]), by=id]
idx = first_last$row_num
# 2) extract rows by number
df[idx]
请务必查看wiki以获取数据。为了完整起见,我们介绍了表基础知识:您可以传递索引向量:
df %>% arrange(stopSequence) %>% group_by(id) %>% slice(c(1,n()))
给
id stopId stopSequence
1 1 a 1
2 1 c 3
3 2 b 1
4 2 c 4
5 3 b 1
6 3 a 3
我知道dplyr指定的问题。但是,由于其他人已经发布了使用其他软件包的解决方案,我决定也尝试使用其他软件包: 基本包:
df <- df[with(df, order(id, stopSequence, stopId)), ]
merge(df[!duplicated(df$id), ],
df[!duplicated(df$id, fromLast = TRUE), ],
all = TRUE)
输出:
id stopId StopSequence
1 1 a 1
2 1 c 3
3 2 b 1
4 2 c 4
5 3 a 3
6 3 b 1
使用data.table:
df <- setDT(df)
df[order(id, stopSequence)][, .SD[c(1,.N)], by=id]
另一种方法是使用lappy和dplyr语句。我们可以对同一语句应用任意数量的任何摘要函数:
lapply(c(first, last),
function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()
例如,您可能对具有max STOPSERQUENCE值的行也感兴趣,并执行以下操作:
lapply(c(first, last, max("stopSequence")),
function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()
另一个不同的BaseR替代方案是按id和stopSequence进行一次排序,根据id对它们进行分割,对于每个id,我们只选择第一个和最后一个索引,并使用这些索引对数据帧进行子集划分
df[sapply(with(df, split(order(id, stopSequence), id)), function(x)
c(x[1], x[length(x)])), ]
# id stopId stopSequence
#1 1 a 1
#3 1 c 3
#5 2 b 1
#6 2 c 4
#8 3 b 1
#7 3 a 3
或类似使用
使用which.min和which.max:
基准
它也比当前接受的答案快得多,因为我们按组查找最小值和最大值,而不是对整个stopSequence列进行排序
# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F))
bench::mark(
mm =df2 %>%
group_by(id) %>%
slice(c(which.min(stopSequence), which.max(stopSequence))),
jeremy = df2 %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 mm 22.6ms 27ms 34.9 14.2MB 21.3
#> 2 jeremy 254.3ms 273ms 3.66 58.4MB 11.0
没有考虑过编写函数-当然这是一种很好的方法来完成更复杂的事情。与仅使用切片相比,这似乎过于复杂了,例如df%>%arrangestopSequence%>%group_byid%>%slicec1,我并不反对,我在帖子中指出jeremycg是一个更好的答案,但这里有一个do示例可能会在slice不起作用时对其他人有所帮助,即在团队中进行更复杂的操作。而且,你应该把你的评论作为一个答案,这是最好的答案。或者df[df[orderstopSequence,.I[c1.N],keyby=id]$V1]。看到id出现两次对我来说很奇怪。你可以在setDT调用中设置键。因此,这里不需要进行订单调用。@Artemkletsov-不过,您可能并不总是想设置键。或者df[orderstopSequence.SD[c1L.N],by=id]。请参阅@JWilliman,这不一定完全相同,因为它不会在id上重新排序。我认为df[orderstopSequence,.SD[c1L,.N],keyby=id]应该使用与上面的解决方案稍有不同的方法,结果将是%c1中的keyedrownumber%,n将无需运行矢量扫描twice@MichaelChirico我怀疑你漏掉了一个??i、 e.filterrow_number%在%c1中,n可能比filter还要快-还没有测试过这一点,但请参见@Tjebo与filter不同,slice可以多次返回同一行,例如mtcars[1,]%>%slicec1,n因此从这个意义上讲,它们之间的选择取决于您想要返回的内容。我希望计时很接近,除非n非常大,而slice可能更受欢迎,但也没有进行测试。另请参阅
lapply(c(first, last),
function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()
lapply(c(first, last, max("stopSequence")),
function(x) df %>% group_by(id) %>% summarize_all(funs(x))) %>%
bind_rows()
df[sapply(with(df, split(order(id, stopSequence), id)), function(x)
c(x[1], x[length(x)])), ]
# id stopId stopSequence
#1 1 a 1
#3 1 c 3
#5 2 b 1
#6 2 c 4
#8 3 b 1
#7 3 a 3
df[unlist(with(df, by(order(id, stopSequence), id, function(x)
c(x[1], x[length(x)])))), ]
library(dplyr, warn.conflicts = F)
df %>%
group_by(id) %>%
slice(c(which.min(stopSequence), which.max(stopSequence)))
#> # A tibble: 6 x 3
#> # Groups: id [3]
#> id stopId stopSequence
#> <dbl> <fct> <dbl>
#> 1 1 a 1
#> 2 1 c 3
#> 3 2 b 1
#> 4 2 c 4
#> 5 3 b 1
#> 6 3 a 3
# create a 100k times longer data frame
df2 <- bind_rows(replicate(1e5, df, F))
bench::mark(
mm =df2 %>%
group_by(id) %>%
slice(c(which.min(stopSequence), which.max(stopSequence))),
jeremy = df2 %>%
group_by(id) %>%
arrange(stopSequence) %>%
filter(row_number()==1 | row_number()==n()))
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 mm 22.6ms 27ms 34.9 14.2MB 21.3
#> 2 jeremy 254.3ms 273ms 3.66 58.4MB 11.0