R 按组高效筛选多个列
假设一个数据集包含每个ID的多行和多个列,其中包含一些存储为字符串的代码:R 按组高效筛选多个列,r,R,假设一个数据集包含每个ID的多行和多个列,其中包含一些存储为字符串的代码: df <- data.frame(id = rep(1:3, each = 2), var1 = c("X1", "Y1", "Y2", "Y3", "Z1", "Z2"), var2 = c("Y1", &q
df <- data.frame(id = rep(1:3, each = 2),
var1 = c("X1", "Y1", "Y2", "Y3", "Z1", "Z2"),
var2 = c("Y1", "X2", "Y2", "Y3", "Z1", "Z2"),
var3 = c("Y1", "Y2", "X1", "Y3", "Z1", "Z2"),
stringsAsFactors = FALSE)
id var1 var2 var3
1 1 X1 Y1 Y1
2 1 Y1 X2 Y2
3 2 Y2 Y2 X1
4 2 Y3 Y3 Y3
5 3 Z1 Z1 Z1
6 3 Z2 Z2 Z2
它工作良好,结构紧凑,易于理解,但是,对于大数据集、数百万个ID和数千万个观察值,它的效率相当低。我欢迎任何关于使用任何库计算效率更高的代码的想法。这里是data.table的变体-
library(data.table)
cols <- grep('var', names(df))
setDT(df)
df[, .SD[all(!Reduce(`|`, lapply(.SD, grepl, pattern = '^X')))], id, .SDcols = cols]
# id var1 var2 var3
#1: 3 Z1 Z1 Z1
#2: 3 Z2 Z2 Z2
这里是另一种tidyverse方法
my_fun <- function(.data) {
.data %>%
group_by(id) %>%
filter(!grepl("X", paste(var1, var2, var3, collapse = ""))) %>%
ungroup()
}
my_fun(df)
# # A tibble: 2 x 4
# id var1 var2 var3
# <int> <chr> <chr> <chr>
# 1 3 Z1 Z1 Z1
# 2 3 Z2 Z2 Z2
df_fun <- function(.data) {
.data %>%
group_by(id) %>%
filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`))) %>%
ungroup()
}
performance <- bench::mark(
my_fun(df),
df_fun(df)
)
performance %>% select(1:4)
# # A tibble: 2 x 4
# expression min median `itr/sec`
# <bch:expr> <bch:tm> <bch:tm> <dbl>
# 1 my_fun(df) 2.6ms 2.7ms 364.
# 2 df_fun(df) 6.01ms 6.39ms 152.
您可以简单地使用cur_数据,方法是使其行为类似于向量/矩阵,即用as.vector或更恰当地用as.matrix包装它
图书馆管理员
df%>%
组\按ID%>%
滤器anystr_detectas.matrixcur_数据'X'
>一个tibble:2x4
>分组:id[1]
>id var1 var2 var3
>
>1 3 Z1 Z1 Z1 Z1
>23Z2Z2Z2
或者如果您只想在选定列上使用它
df %>%
group_by(id) %>%
filter(!any(grepl('X', as.matrix(select(cur_data(), starts_with('var'))))))
另一种选择是使用新的if_all或if_any。要解决上述问题,我们需要将其进一步概括为: 图书馆弹琴 df%>% 组\按ID%>% filterallif_all以var开头, ~ !grepl^X,.X >一个tibble:2x4 >分组:id[1] >id var1 var2 var3 > >1 3 Z1 Z1 Z1 Z1 >23Z2Z2Z2 由v0.3.0于2021年6月14日创建的其他两个数据表解决方案:
library(data.table)
setDT(df)
df[,.SD[!any(grepl("X", .SD))],by=id,.SDcols=patterns('var')]
id var1 var2 var3
1: 3 Z1 Z1 Z1
2: 3 Z2 Z2 Z2
这可能以可读性稍差为代价:
df[df[, .(keep=.I[!any(grepl("X", .SD))]), by=id,.SDcols=patterns('var')]$keep]
基准:
如果“id”始终是第一列,其余列中的值:
df[df$id %in% names(which(!tapply(grepl("X", as.matrix(df[-1])),
rep(df[ , 1], ncol(df) - 1), any))), ]
一些可能的速度点
尽量不要在dplyr中使用group_by或data.table中使用group_by=之类的方法,因为这样会降低整体性能
如果您有固定的目标模式,例如,从X开始,那么substr可能比使用模式^X的grepl更有效
一些基本的R方法
似乎我们可以通过下面的一个在基础上进一步加速
TIC1 <- function() {
subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}
给定
另一个base R解决方案,使用ThomasIsCoding提供的代码示例。 首先,定义一个辅助函数:
repeated_or <- function(...) {
L <- list(...)
ans <- L[[1L]]
if (...length() > 1L)
for (i in seq.int(2L, ...length()))
ans <- ans | L[[i]]
ans
}
现在,使用ThomasIsCoding提供的代码:
n <- 1e4
df <- data.frame(
id = rep(1:(n / 2), each = 2, length.out = n),
var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
stringsAsFactors = FALSE
)
library("microbenchmark")
microbenchmark(
fun1(),
fun2(),
TIC1(),
TIC2(),
waldi_speed(),
unit = "relative"
)
## Unit: relative
## expr min lq mean median uq max neval
## fun1() 1.180372 1.183109 1.205269 1.189091 1.187704 1.163667 100
## fun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
## TIC1() 3.487775 3.462417 3.549228 3.491580 3.494310 2.857216 100
## TIC2() 1.140145 1.131872 1.141466 1.146900 1.142863 1.078746 100
## waldi_speed() 31.440025 30.845971 30.556054 30.798701 30.338251 26.213920 100
这种方法将所有列与paste结合起来,然后依靠stringr生成一个包含所有ID的向量,其中“X”存在 图书馆管理员 图书馆长 df% 组\按ID%>% filteralreduce.x=crossvar1:var3,~!grepl^X,,.f=`&`} >用户系统运行时间 > 0.022 0.001 0.023 答复 系统时间{ 标准% {str_sub.[str_detect.,'X'],start=-1}|> 唯一的 df_已过滤用户系统 > 0.002 0.000 0.001 df_过滤 >id var1 var2 var3 >1 3 Z1 Z1 Z1 Z1 >23Z2Z2Z2 2015年6月2日由v2.0.0创建 专用函数:如果找到了特定代码,您可能会执行许多操作。对此类代码使用专用函数可能比普通函数快。StartWithx,X将比grepl^X,X快 子集设置:如果查找特定代码的函数速度慢,则操作速度慢于子集设置,请仅对尚未找到代码的行中的其余列执行此操作 哈希查找:如果具有相同id的任何行有命中,则需要比较没有直接命中的所有剩余id。因此,列表中包含命中id的查找应该是快速的。使用类似fastmatch::fmatch的哈希表,此查找可能是快速的 存储类型:如果data.frame的列具有所有相同的类型,则将其存储在矩阵而不是列表中时,对该列的操作可能会更快 您可以取消列出df[-1]并测试它是否从X开始,创建一个带有df nrow的矩阵并获取行和,如果它大于0,则id有命中率。我将这些id存储在I中。可选,可以计算唯一id。现在测试id是否为%in%I,并使用!可能比%in%更快的替代方法是来自fastmatch的%fin% 此外,还可以测试它是否仅在没有命中的情况下以X开头,并且仅在没有命中X的行中使用%in%,这在子集设置比以X开头的测试快以及子集设置比查找匹配快的情况下是有意义的
i <- Reduce(function(x, y) `[<-`(x,!x,startsWith(y[!x], "X")),
df[,-1], logical(nrow(df)))
i[!i] <- df$id[!i] %in% df$id[i]
df[!i,]
另一个Base R解决方案,我还没有提到。利用按行数进行模运算,快速返回要删除的行:
df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),]
id var1 var2 var3
5 3 Z1 Z1 Z1
6 3 Z2 Z2 Z2
它很快,以微秒为单位:
library(microbenchmark)
microbenchmark(df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),])
Unit: microseconds
min lq mean median uq max
136.601 140.8505 165.7009 145.4515 172.9005 328.801
出于好奇,如果你能将我的方法if_all包括在基准测试中,那就太好了。看到@Zaw的base R方法比{data.table}好,真的很惊讶!@TimTeaFan,更新的基准测试谢谢!超级有趣的基准测试。希望新的if_all会快得多!做得好,投了赞成票!实际上性能可以提高,而且我的回答非常好!我开始对data.table失去信任;-投了赞成票!@Waldi data.table本身速度非常快,但by=是瓶颈:@ThomsIsCoding,另一个退出e.效率/效率
蚂蚁一号:潜艇!id%在%id[Reduce'|',lapplydf[,-1],functionx substrx,1,1=='X']@Waldi是的,它是有效的,只要列不多:我认为这不会给出所需的输出。你可以试试OP的df@ThomasIsCoding:是的,你是对的;非常感谢。我只删除了包含X的行。我已经更新了答案。干得好!向上投票!杰出的向上投票!顺便说一下,如果您有很多列,Reduce可能会很慢,但是FGKi3确实令人印象深刻!
n <- 5e4
df <- data.frame(
id = rep(1:(n / 2), each = 2, length.out = n),
var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
stringsAsFactors = FALSE
)
TIC1 <- function() {
subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}
TIC2 <- function() {
subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}
TIC3 <- function() {
subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}
waldi_speed <- function() {
setDT(df)
df[df[, .(keep = .I[!any(grepl("X", .SD))]), by = id, .SDcols = patterns("var")]$keep]
}
repeated_or <- function(...) {
L <- list(...)
ans <- L[[1L]]
if (...length() > 1L) {
for (i in seq.int(2L, ...length())) {
ans <- ans | L[[i]]
}
}
ans
}
fun1 <- function() {
## using a pattern
m <- lapply(df[, -1], grepl, pattern = "^X", perl = TRUE)
df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
fun2 <- function() {
## using a fixed string
m <- lapply(df[, -1], function(x) substr(x, 1, 1) == "X")
df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
repeated_or <- function(...) {
L <- list(...)
ans <- L[[1L]]
if (...length() > 1L)
for (i in seq.int(2L, ...length()))
ans <- ans | L[[i]]
ans
}
fun1 <- function() {
## using a pattern
m <- lapply(df[, -1], grepl, pattern = "^X", perl = TRUE)
df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
fun2 <- function() {
## using a fixed string
m <- lapply(df[, -1], function(x) substr(x, 1,1) == "X")
df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
n <- 1e4
df <- data.frame(
id = rep(1:(n / 2), each = 2, length.out = n),
var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
stringsAsFactors = FALSE
)
library("microbenchmark")
microbenchmark(
fun1(),
fun2(),
TIC1(),
TIC2(),
waldi_speed(),
unit = "relative"
)
## Unit: relative
## expr min lq mean median uq max neval
## fun1() 1.180372 1.183109 1.205269 1.189091 1.187704 1.163667 100
## fun2() 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100
## TIC1() 3.487775 3.462417 3.549228 3.491580 3.494310 2.857216 100
## TIC2() 1.140145 1.131872 1.141466 1.146900 1.142863 1.078746 100
## waldi_speed() 31.440025 30.845971 30.556054 30.798701 30.338251 26.213920 100
i <- df$id[unlist(df[-1], FALSE, FALSE) |>
startsWith("X") |>
matrix(nrow(df)) |>
rowSums() > 0]
#i <- unique(i) #Optional
#i <- kit::funique(i) #Optional faster unique
df[!df$id %in% i,]
# id var1 var2 var3
#5 3 Z1 Z1 Z1
#6 3 Z2 Z2 Z2
library(fastmatch)
df[!df$id %fin% i,]
i <- lapply(df[,-1], startsWith, "X")
i <- df$id[Reduce(`|`, i)]
#i <- eval(str2lang(paste0("i[[", seq_along(i), "]]", collapse = "|"))) #Alternative to Reduce
df[!df$id %in% i,]
i <- Reduce(function(x, y) `[<-`(x,!x,startsWith(y[!x], "X")),
df[,-1], logical(nrow(df)))
i[!i] <- df$id[!i] %in% df$id[i]
df[!i,]
getDf <- function(nr, nc) { #function to creat example dataset
data.frame(id = sample(seq_len(nr/5), nr, TRUE),
lapply(setNames(seq_len(nc), paste0("var", seq_len(nc))),
function(i) paste0(sample(LETTERS, nr, TRUE), sample(0:9, nr, TRUE))))
}
library(fastmatch)
FGKi1 <- function() {
df[!df$id %in% df$id[rowSums(matrix(startsWith(unlist(df[-1], FALSE, FALSE),
"X"), nrow(df))) > 0],]}
FGKi2 <- function() {
df[!df$id %in% unique(df$id[rowSums(matrix(startsWith(unlist(df[-1],
FALSE, FALSE), "X"), nrow(df))) > 0]),]}
FGKi3 <- function() {
df[!df$id %fin% df$id[rowSums(matrix(startsWith(unlist(df[-1], FALSE, FALSE),
"X"), nrow(df))) > 0],]}
FGKi4 <- function() {
df[!df$id %in% df$id[Reduce(`|`, lapply(df[, -1], startsWith, "X"))],]
}
FGKi5 <- function() {
df[!df$id %fin% df$id[Reduce(`|`, lapply(df[, -1], startsWith, "X"))],]
}
FGKi6 <- function() {
i <- Reduce(`|`, lapply(df[, -1], startsWith, "X"))
i[!i] <- df$id[!i] %in% df$id[i]
df[!i,]
}
FGKi7 <- function() {
i <- lapply(df[, -1], startsWith, "X")
i <- eval(str2lang(paste0("i[[", seq_along(i), "]]", collapse = "|")))
df[!df$id %fin% df$id[i],]
}
repeated_or <- function(...) {
L <- list(...)
ans <- L[[1L]]
if (...length() > 1L)
for (i in seq.int(2L, ...length()))
ans <- ans | L[[i]]
ans
}
fun2 <- function() {
## using a fixed string
m <- lapply(df[, -1], function(x) substr(x, 1,1) == "X")
df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
TIC2 <- function() {
subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}
set.seed(42)
df <- getDf(1e5, 3) #3 col wide Table
bench::mark(TIC2(), fun2(), FGKi1(), FGKi2(), FGKi3(), FGKi4(),
FGKi5(), FGKi6(), FGKi7())
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 TIC2() 24.7ms 24.9ms 40.2 15.07MB 112. 5 14 125ms
#2 fun2() 22.3ms 22.5ms 43.9 11.26MB 39.9 11 10 251ms
#3 FGKi1() 14.6ms 15ms 66.8 12.78MB 58.9 17 15 255ms
#4 FGKi2() 14.9ms 15.1ms 66.3 12.97MB 58.5 17 15 256ms
#5 FGKi3() 12.1ms 12.3ms 80.8 12.23MB 72.3 19 17 235ms
#6 FGKi4() 12.7ms 12.9ms 77.7 8.97MB 27.7 28 10 360ms
#7 FGKi5() 10.2ms 10.3ms 96.4 8.42MB 51.4 30 16 311ms
#8 FGKi6() 13.2ms 13.3ms 75.1 11.38MB 53.6 21 15 280ms
#9 FGKi7() 10.3ms 10.4ms 95.2 8.42MB 36.8 31 12 326ms
set.seed(42)
df <- getDf(1e4, 1e3) #1000 col wide Table
bench::mark(TIC2(), fun2(), FGKi1(), FGKi2(), FGKi3(), FGKi4(),
FGKi5(), FGKi6(), FGKi7())
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time
# <bch:expr> <bch:t> <bch:t> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm>
#1 TIC2() 430.4ms 434.4ms 2.30 230MB 3.45 2 3 869ms
#2 fun2() 374.6ms 405.6ms 2.47 191MB 6.16 2 5 811ms
#3 FGKi1() 110.8ms 117.7ms 7.87 191MB 13.8 4 7 509ms
#4 FGKi2() 108.9ms 111.1ms 8.32 191MB 11.7 5 7 601ms
#5 FGKi3() 107.8ms 107.8ms 9.25 191MB 9.25 5 5 541ms
#6 FGKi4() 52.5ms 54.6ms 16.6 115MB 14.7 9 8 543ms
#7 FGKi5() 52.5ms 54.7ms 18.3 115MB 18.3 10 10 547ms
#8 FGKi6() 52.8ms 55.2ms 18.1 115MB 16.3 10 9 553ms
#9 FGKi7() 53.7ms 56.6ms 17.6 115MB 17.6 9 9 510ms
#Warning message:
#Some expressions had a GC in every iteration; so filtering is disabled.
df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),]
id var1 var2 var3
5 3 Z1 Z1 Z1
6 3 Z2 Z2 Z2
library(microbenchmark)
microbenchmark(df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),])
Unit: microseconds
min lq mean median uq max
136.601 140.8505 165.7009 145.4515 172.9005 328.801