R 按组高效筛选多个列

R 按组高效筛选多个列,r,R,假设一个数据集包含每个ID的多行和多个列,其中包含一些存储为字符串的代码: df <- data.frame(id = rep(1:3, each = 2), var1 = c("X1", "Y1", "Y2", "Y3", "Z1", "Z2"), var2 = c("Y1", &q

假设一个数据集包含每个ID的多行和多个列,其中包含一些存储为字符串的代码:

df <- data.frame(id = rep(1:3, each = 2),
                 var1 = c("X1", "Y1", "Y2", "Y3", "Z1", "Z2"),
                 var2 = c("Y1", "X2", "Y2", "Y3", "Z1", "Z2"),
                 var3 = c("Y1", "Y2", "X1", "Y3", "Z1", "Z2"),
                 stringsAsFactors = FALSE)

  id var1 var2 var3
1  1   X1   Y1   Y1
2  1   Y1   X2   Y2
3  2   Y2   Y2   X1
4  2   Y3   Y3   Y3
5  3   Z1   Z1   Z1
6  3   Z2   Z2   Z2
它工作良好,结构紧凑,易于理解,但是,对于大数据集、数百万个ID和数千万个观察值,它的效率相当低。我欢迎任何关于使用任何库计算效率更高的代码的想法。

这里是data.table的变体-

library(data.table)
cols <- grep('var', names(df))

setDT(df)

df[, .SD[all(!Reduce(`|`, lapply(.SD, grepl, pattern = '^X')))], id, .SDcols = cols]

#   id var1 var2 var3
#1:  3   Z1   Z1   Z1
#2:  3   Z2   Z2   Z2
这里是另一种tidyverse方法

my_fun <- function(.data) {
  .data %>% 
    group_by(id) %>% 
    filter(!grepl("X", paste(var1, var2, var3, collapse = ""))) %>% 
    ungroup()
}

my_fun(df)

# # A tibble: 2 x 4
#      id var1  var2  var3 
#   <int> <chr> <chr> <chr>
# 1     3 Z1    Z1    Z1   
# 2     3 Z2    Z2    Z2   

df_fun <- function(.data) {
  .data %>%
    group_by(id) %>%
    filter(all(reduce(.x = across(var1:var3, ~ !grepl("^X", .)), .f = `&`))) %>% 
    ungroup()
}

performance <- bench::mark(
  my_fun(df),
  df_fun(df)
)

performance %>% select(1:4)

# # A tibble: 2 x 4
#   expression       min   median `itr/sec`
#   <bch:expr>  <bch:tm> <bch:tm>     <dbl>
# 1 my_fun(df)    2.6ms    2.7ms      364.
# 2 df_fun(df)    6.01ms   6.39ms      152.
您可以简单地使用cur_数据,方法是使其行为类似于向量/矩阵,即用as.vector或更恰当地用as.matrix包装它

图书馆管理员 df%>% 组\按ID%>% 滤器anystr_detectas.matrixcur_数据'X' >一个tibble:2x4 >分组:id[1] >id var1 var2 var3 > >1 3 Z1 Z1 Z1 Z1 >23Z2Z2Z2 或者如果您只想在选定列上使用它

df %>%
  group_by(id) %>%
  filter(!any(grepl('X', as.matrix(select(cur_data(), starts_with('var'))))))

另一种选择是使用新的if_all或if_any。要解决上述问题,我们需要将其进一步概括为:

图书馆弹琴 df%>% 组\按ID%>% filterallif_all以var开头, ~ !grepl^X,.X >一个tibble:2x4 >分组:id[1] >id var1 var2 var3 > >1 3 Z1 Z1 Z1 Z1 >23Z2Z2Z2 由v0.3.0于2021年6月14日创建的其他两个数据表解决方案:

library(data.table)
setDT(df)
df[,.SD[!any(grepl("X", .SD))],by=id,.SDcols=patterns('var')]

   id var1 var2 var3
1:  3   Z1   Z1   Z1
2:  3   Z2   Z2   Z2
这可能以可读性稍差为代价:

df[df[, .(keep=.I[!any(grepl("X", .SD))]), by=id,.SDcols=patterns('var')]$keep]
基准:


如果“id”始终是第一列,其余列中的值:

df[df$id %in% names(which(!tapply(grepl("X", as.matrix(df[-1])),
                                  rep(df[ , 1], ncol(df) - 1), any))), ]
一些可能的速度点 尽量不要在dplyr中使用group_by或data.table中使用group_by=之类的方法,因为这样会降低整体性能 如果您有固定的目标模式,例如,从X开始,那么substr可能比使用模式^X的grepl更有效 一些基本的R方法 似乎我们可以通过下面的一个在基础上进一步加速

TIC1 <- function() {
    subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}
给定


另一个base R解决方案,使用ThomasIsCoding提供的代码示例。 首先,定义一个辅助函数:

repeated_or <- function(...) {
    L <- list(...)
    ans <- L[[1L]]
    if (...length() > 1L)
        for (i in seq.int(2L, ...length()))
            ans <- ans | L[[i]]
    ans
}
现在,使用ThomasIsCoding提供的代码:

n <- 1e4
df <- data.frame(
    id = rep(1:(n / 2), each = 2, length.out = n),
    var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    stringsAsFactors = FALSE
)

library("microbenchmark")
microbenchmark(
    fun1(),
    fun2(),
    TIC1(),
    TIC2(),
    waldi_speed(),
    unit = "relative"
)
## Unit: relative
##           expr       min        lq      mean    median        uq       max neval
##         fun1()  1.180372  1.183109  1.205269  1.189091  1.187704  1.163667   100
##         fun2()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000   100
##         TIC1()  3.487775  3.462417  3.549228  3.491580  3.494310  2.857216   100
##         TIC2()  1.140145  1.131872  1.141466  1.146900  1.142863  1.078746   100
##  waldi_speed() 31.440025 30.845971 30.556054 30.798701 30.338251 26.213920   100

这种方法将所有列与paste结合起来,然后依靠stringr生成一个包含所有ID的向量,其中“X”存在

图书馆管理员 图书馆长 df% 组\按ID%>% filteralreduce.x=crossvar1:var3,~!grepl^X,,.f=`&`} >用户系统运行时间 > 0.022 0.001 0.023 答复 系统时间{ 标准% {str_sub.[str_detect.,'X'],start=-1}|> 唯一的 df_已过滤用户系统 > 0.002 0.000 0.001 df_过滤 >id var1 var2 var3 >1 3 Z1 Z1 Z1 Z1 >23Z2Z2Z2 2015年6月2日由v2.0.0创建

专用函数:如果找到了特定代码,您可能会执行许多操作。对此类代码使用专用函数可能比普通函数快。StartWithx,X将比grepl^X,X快

子集设置:如果查找特定代码的函数速度慢,则操作速度慢于子集设置,请仅对尚未找到代码的行中的其余列执行此操作

哈希查找:如果具有相同id的任何行有命中,则需要比较没有直接命中的所有剩余id。因此,列表中包含命中id的查找应该是快速的。使用类似fastmatch::fmatch的哈希表,此查找可能是快速的

存储类型:如果data.frame的列具有所有相同的类型,则将其存储在矩阵而不是列表中时,对该列的操作可能会更快

您可以取消列出df[-1]并测试它是否从X开始,创建一个带有df nrow的矩阵并获取行和,如果它大于0,则id有命中率。我将这些id存储在I中。可选,可以计算唯一id。现在测试id是否为%in%I,并使用!可能比%in%更快的替代方法是来自fastmatch的%fin%

此外,还可以测试它是否仅在没有命中的情况下以X开头,并且仅在没有命中X的行中使用%in%,这在子集设置比以X开头的测试快以及子集设置比查找匹配快的情况下是有意义的

i <- Reduce(function(x, y) `[<-`(x,!x,startsWith(y[!x], "X")),
       df[,-1], logical(nrow(df)))
i[!i] <- df$id[!i] %in% df$id[i]
df[!i,]

另一个Base R解决方案,我还没有提到。利用按行数进行模运算,快速返回要删除的行:

df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),]
id var1 var2 var3
5  3   Z1   Z1   Z1
6  3   Z2   Z2   Z2
它很快,以微秒为单位:

library(microbenchmark)
microbenchmark(df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),])
Unit: microseconds
min       lq     mean   median       uq     max
136.601 140.8505 165.7009 145.4515 172.9005 328.801 

出于好奇,如果你能将我的方法if_all包括在基准测试中,那就太好了。看到@Zaw的base R方法比{data.table}好,真的很惊讶!@TimTeaFan,更新的基准测试谢谢!超级有趣的基准测试。希望新的if_all会快得多!做得好,投了赞成票!实际上性能可以提高,而且我的回答非常好!我开始对data.table失去信任;-投了赞成票!@Waldi data.table本身速度非常快,但by=是瓶颈:@ThomsIsCoding,另一个退出e.效率/效率
蚂蚁一号:潜艇!id%在%id[Reduce'|',lapplydf[,-1],functionx substrx,1,1=='X']@Waldi是的,它是有效的,只要列不多:我认为这不会给出所需的输出。你可以试试OP的df@ThomasIsCoding:是的,你是对的;非常感谢。我只删除了包含X的行。我已经更新了答案。干得好!向上投票!杰出的向上投票!顺便说一下,如果您有很多列,Reduce可能会很慢,但是FGKi3确实令人印象深刻!
n <- 5e4
df <- data.frame(
    id = rep(1:(n / 2), each = 2, length.out = n),
    var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    stringsAsFactors = FALSE
)

TIC1 <- function() {
    subset(df, ave(rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") == 0, id, FUN = all))
}

TIC2 <- function() {
    subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}

TIC3 <- function() {
    subset(df, !id %in% id[do.call(pmax, lapply(df[-1], function(v) substr(v, 1, 1) == "X")) > 0])
}


waldi_speed <- function() {
    setDT(df)
    df[df[, .(keep = .I[!any(grepl("X", .SD))]), by = id, .SDcols = patterns("var")]$keep]
}


repeated_or <- function(...) {
    L <- list(...)
    ans <- L[[1L]]
    if (...length() > 1L) {
          for (i in seq.int(2L, ...length())) {
                ans <- ans | L[[i]]
            }
      }
    ans
}

fun1 <- function() {
    ## using a pattern
    m <- lapply(df[, -1], grepl, pattern = "^X", perl = TRUE)
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}

fun2 <- function() {
    ## using a fixed string
    m <- lapply(df[, -1], function(x) substr(x, 1, 1) == "X")
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
repeated_or <- function(...) {
    L <- list(...)
    ans <- L[[1L]]
    if (...length() > 1L)
        for (i in seq.int(2L, ...length()))
            ans <- ans | L[[i]]
    ans
}
fun1 <- function() {
    ## using a pattern
    m <- lapply(df[, -1], grepl, pattern = "^X", perl = TRUE)
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}

fun2 <- function() {
    ## using a fixed string
    m <- lapply(df[, -1], function(x) substr(x, 1,1) == "X")
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
n <- 1e4
df <- data.frame(
    id = rep(1:(n / 2), each = 2, length.out = n),
    var1 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var2 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    var3 = mapply(paste0, LETTERS[23 + sample(1:3, n, replace = T)], sample(1:3, n, replace = T)),
    stringsAsFactors = FALSE
)

library("microbenchmark")
microbenchmark(
    fun1(),
    fun2(),
    TIC1(),
    TIC2(),
    waldi_speed(),
    unit = "relative"
)
## Unit: relative
##           expr       min        lq      mean    median        uq       max neval
##         fun1()  1.180372  1.183109  1.205269  1.189091  1.187704  1.163667   100
##         fun2()  1.000000  1.000000  1.000000  1.000000  1.000000  1.000000   100
##         TIC1()  3.487775  3.462417  3.549228  3.491580  3.494310  2.857216   100
##         TIC2()  1.140145  1.131872  1.141466  1.146900  1.142863  1.078746   100
##  waldi_speed() 31.440025 30.845971 30.556054 30.798701 30.338251 26.213920   100
i <- df$id[unlist(df[-1], FALSE, FALSE) |>
             startsWith("X") |>
             matrix(nrow(df)) |>
             rowSums() > 0]
#i <- unique(i)       #Optional
#i <- kit::funique(i) #Optional faster unique
df[!df$id %in% i,]
#  id var1 var2 var3
#5  3   Z1   Z1   Z1
#6  3   Z2   Z2   Z2

library(fastmatch)
df[!df$id %fin% i,]
i <- lapply(df[,-1], startsWith, "X")
i <- df$id[Reduce(`|`, i)]
#i <- eval(str2lang(paste0("i[[", seq_along(i), "]]", collapse = "|"))) #Alternative to Reduce
df[!df$id %in% i,]
i <- Reduce(function(x, y) `[<-`(x,!x,startsWith(y[!x], "X")),
       df[,-1], logical(nrow(df)))
i[!i] <- df$id[!i] %in% df$id[i]
df[!i,]
 getDf <- function(nr, nc) { #function to creat example dataset
    data.frame(id = sample(seq_len(nr/5), nr, TRUE),
      lapply(setNames(seq_len(nc), paste0("var", seq_len(nc))),
        function(i) paste0(sample(LETTERS, nr, TRUE), sample(0:9, nr, TRUE))))
}

library(fastmatch)
FGKi1 <- function() {
  df[!df$id %in% df$id[rowSums(matrix(startsWith(unlist(df[-1], FALSE, FALSE),
                                                 "X"), nrow(df))) > 0],]}
FGKi2 <- function() {
  df[!df$id %in% unique(df$id[rowSums(matrix(startsWith(unlist(df[-1],
                                 FALSE, FALSE), "X"), nrow(df))) > 0]),]}
FGKi3 <- function() {
  df[!df$id %fin% df$id[rowSums(matrix(startsWith(unlist(df[-1], FALSE, FALSE),
                                                  "X"), nrow(df))) > 0],]}
FGKi4 <- function() {
  df[!df$id %in% df$id[Reduce(`|`, lapply(df[, -1], startsWith, "X"))],]
}
FGKi5 <- function() {
  df[!df$id %fin% df$id[Reduce(`|`, lapply(df[, -1], startsWith, "X"))],]
}
FGKi6 <- function() {
  i <- Reduce(`|`, lapply(df[, -1], startsWith, "X"))
  i[!i] <- df$id[!i] %in% df$id[i]
  df[!i,]
}
FGKi7 <- function() {
  i <- lapply(df[, -1], startsWith, "X")
  i <- eval(str2lang(paste0("i[[", seq_along(i), "]]", collapse = "|")))
  df[!df$id %fin% df$id[i],]
}
repeated_or <- function(...) {
    L <- list(...)
    ans <- L[[1L]]
    if (...length() > 1L)
        for (i in seq.int(2L, ...length()))
            ans <- ans | L[[i]]
    ans
}
fun2 <- function() {
    ## using a fixed string
    m <- lapply(df[, -1], function(x) substr(x, 1,1) == "X")
    df[!df$id %in% df$id[do.call(repeated_or, m)], ]
}
TIC2 <- function() {
    subset(df, !id %in% id[rowSums(substr(as.matrix(df[, -1]), 1, 1) == "X") > 0])
}
set.seed(42)
df <- getDf(1e5, 3) #3 col wide Table
bench::mark(TIC2(), fun2(), FGKi1(), FGKi2(), FGKi3(), FGKi4(),
   FGKi5(), FGKi6(), FGKi7())
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 TIC2()      24.7ms  24.9ms      40.2   15.07MB    112.      5    14      125ms
#2 fun2()      22.3ms  22.5ms      43.9   11.26MB     39.9    11    10      251ms
#3 FGKi1()     14.6ms    15ms      66.8   12.78MB     58.9    17    15      255ms
#4 FGKi2()     14.9ms  15.1ms      66.3   12.97MB     58.5    17    15      256ms
#5 FGKi3()     12.1ms  12.3ms      80.8   12.23MB     72.3    19    17      235ms
#6 FGKi4()     12.7ms  12.9ms      77.7    8.97MB     27.7    28    10      360ms
#7 FGKi5()     10.2ms  10.3ms      96.4    8.42MB     51.4    30    16      311ms
#8 FGKi6()     13.2ms  13.3ms      75.1   11.38MB     53.6    21    15      280ms
#9 FGKi7()     10.3ms  10.4ms      95.2    8.42MB     36.8    31    12      326ms

set.seed(42)
df <- getDf(1e4, 1e3) #1000 col wide Table
bench::mark(TIC2(), fun2(), FGKi1(), FGKi2(), FGKi3(), FGKi4(),
   FGKi5(), FGKi6(), FGKi7())
#  expression     min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
#  <bch:expr> <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
#1 TIC2()     430.4ms 434.4ms      2.30     230MB     3.45     2     3      869ms
#2 fun2()     374.6ms 405.6ms      2.47     191MB     6.16     2     5      811ms
#3 FGKi1()    110.8ms 117.7ms      7.87     191MB    13.8      4     7      509ms
#4 FGKi2()    108.9ms 111.1ms      8.32     191MB    11.7      5     7      601ms
#5 FGKi3()    107.8ms 107.8ms      9.25     191MB     9.25     5     5      541ms
#6 FGKi4()     52.5ms  54.6ms     16.6      115MB    14.7      9     8      543ms
#7 FGKi5()     52.5ms  54.7ms     18.3      115MB    18.3     10    10      547ms
#8 FGKi6()     52.8ms  55.2ms     18.1      115MB    16.3     10     9      553ms
#9 FGKi7()     53.7ms  56.6ms     17.6      115MB    17.6      9     9      510ms
#Warning message:
#Some expressions had a GC in every iteration; so filtering is disabled. 
df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),]
id var1 var2 var3
5  3   Z1   Z1   Z1
6  3   Z2   Z2   Z2
library(microbenchmark)
microbenchmark(df[!(df$id %in% df$id[(which(df=="X1" | df=="X2") %% nrow(df))]),])
Unit: microseconds
min       lq     mean   median       uq     max
136.601 140.8505 165.7009 145.4515 172.9005 328.801