优化版本的grep，使向量与向量匹配_R_Regex_Optimization_Vector

优化版本的grep，使向量与向量匹配

r regex optimization vector

优化版本的grep，使向量与向量匹配,r,regex,optimization,vector,R,Regex,Optimization,Vector,假设我有两个字符向量a和b： set.seed(123) categ <- c("Control", "Gr", "Or", "PMT", "P450") genes <- paste(categ, rep(1:40, each=length(categ)), sep="_") a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="") b0 <- paste (a0, "1", sep="")

假设我有两个字符向量

和

：

set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

但我想知道是否有更有效的方法，因为我将不得不在模拟中对更大的向量运行很多次（注意，我也不想使用

mclappy

，因为我已经使用它来运行模拟的每个迭代）：

由于您不使用正则表达式，但希望在较长的字符串中查找子字符串，因此可以使用

fixed=TRUE

。速度快得多

library(microbenchmark)
microbenchmark(lapply(a, grep, b),                 # original
                 lapply(paste0("^", a), grep, b),  # @flodel
                 lapply(a, grep, b, fixed = TRUE))

Unit: microseconds
                             expr     min       lq   median       uq     max neval
               lapply(a, grep, b) 112.633 114.2340 114.9390 116.0990 326.857   100
  lapply(paste0("^", a), grep, b) 119.949 121.7380 122.7425 123.9775 191.851   100
 lapply(a, grep, b, fixed = TRUE)  21.004  22.5885  23.8580  24.6110  33.608   100

使用更长的向量（原始长度的1000倍）进行测试

ar遵循我最后的建议
你所问的最大问题是，先验地，你需要做length（a）*length（b）
比较。但是，您可以利用以下事实：这里的匹配只发生在字符串的开头（我从注释中收集到的）
我建议您首先将a
和b
向量拆分为列表，然后查看每个项目中的第一个单词（“Or”、“Gr”、“Control”、“PMT”等），然后只在相应的集合中查找匹配项。换句话说，取a
中以或
开头的项目，只在b
中也以或
开头的项目中查找匹配项
让您了解为什么这在复杂性方面是有效的。想象一下a
和b
都有长度n
；有x
可能的前缀，均匀分布在a
和b
中。然后，您只需对您的情况进行x*（n/x*n/x）
与n*n
的比较。这是比较的x
倍。你甚至可以想象用第二个词，第三个词，以递归的方式重复这个过程
下面是它的代码：
reduced.match <- function(a, b) {

   first.word <- function(string) sub("_.*", "", string)

   a.first <- first.word(a)
   b.first <- first.word(b)
   l.first <- unique(c(a.first, b.first))
   a.first <- factor(a.first, l.first)
   b.first <- factor(b.first, l.first)
   a.split <- split(a, a.first)
   b.split <- split(b, b.first)
   a.idx.split <- split(seq_along(a), a.first)
   b.idx.split <- split(seq_along(b), b.first)

   unsorted.matches <-
     Map(function(a, b, i) lapply(a, function(x) i[grep(x, b, fixed = TRUE)]),
         a.split, b.split, b.idx.split, USE.NAMES = FALSE)

   sorted.matches <-
     unlist(unsorted.matches, recursive = FALSE)[
       match(seq_along(a), unlist(a.idx.split))]

   return(sorted.matches)
}

reduced.match如果a和b被排序（并且是唯一的），并且人们对字符串开头的精确匹配感兴趣，那么下面的C代码通常会相对有效（在长度（a）+长度（b）字符串比较的顺序上？）。R包装器确保C代码和R用户获得适当的数据
f3 <- local({
    library(inline)
    .amatch <- cfunction(c(a="character", b="character"),
             includes="#include <string.h>", '
         int len_a = Rf_length(a), len_b = Rf_length(b);
         SEXP ans = PROTECT(allocVector(INTSXP, len_b));
         memset(INTEGER(ans), 0, sizeof(int) * len_b);
         int cmp, i = 0, j = 0;
         while (i < len_a) {
             const char *ap = CHAR(STRING_ELT(a, i));
             while (j < len_b) {
                 cmp = strncmp(ap, CHAR(STRING_ELT(b, j)), strlen(ap));
                 if (cmp > 0) {
                     j += 1;
                 } else break;
             }
             if (j == len_b)
                 break;
             if (cmp == 0)
                 INTEGER(ans)[j++] = i + 1;
             else if (cmp < 0) i += 1;
         }
         UNPROTECT(1);
         return(ans);')

    function(a, b) {
        locale = Sys.getlocale("LC_COLLATE")
        if (locale != "C") {
            warning('temporarily trying to set LC_COLLATE to "C"')
            Sys.setlocale("LC_COLLATE", "C")
            on.exit(Sys.setlocale("LC_COLLATE", locale))
        }
        a0 <- a
        lvls <- unique(a)
        a <- sort(lvls)
        o <- order(b)
        idx <- .amatch(a, b[o])[order(o)]
        f <- factor(a[idx[idx != 0]], levels=lvls)
        split(which(idx != 0), f)[a0]
    }
})

不幸的是，当一个元素是另一个元素的前缀时，这个简单算法会失败
> str(f0(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr "abc"
> str(f3(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr(0) 

与注释相反，对于该数据集（需要为再现性指定随机数种子）
算法f0和f3已被修改为在命名列表中返回索引。
我在自己的数据上测试了@flodel和@Sven Hohenstein提出的不同解决方案（请注意，@Martin Morgan的方法暂时无法测试，因为它不支持作为a
其他元素前缀的a
元素）
重要提示：尽管所有方法在我的具体案例中给出相同的结果，但请注意，它们都有自己的方法，因此可以根据数据的结构给出不同的结果
以下是一个快速总结（结果如下所示）：
在我的测试中，length（a）
和length（b）
分别设置为200或400和2000或10000
在b
最佳方法实际上取决于问题，所有方法都值得针对每个特定情况进行测试
pmatch
始终表现良好（特别是对于长度较小的向量a
和b
，分别小于100和1000-以下未显示）
sapply（a，grep，b，fixed=T）
和reduced.match
（flodel方法）函数的性能总是优于sapply（a，grep，b））
和sapply（paste0（“^”，a），grep，b）
这是可复制的代码以及测试结果
#设置数据集
图书馆（微基准）
categ总是有最多一场比赛吗？如果是这样的话，pmatch（a，b）
听起来是个不错的建议。如果合适，您也可以在删除b
项的最后两个字符后使用精确匹配：match（a，substr（b，1L，nchar（b）-2L））
。你必须在你的大数据上试试看什么更快。谢谢flodel。我的例子有点具体，但我想要一个通用的解决方案。因此，首先，pmatch
对于只有一个匹配项的情况肯定会有所帮助。然后，这里的b
的最后两个字母是非信息性的，但情况可能并非总是如此。这将把它简化为一个grep
：grep（粘贴（a，collapse=“|”）b）
@G.Grothendieck。这是在回答一个不同的问题，不是吗？当正则表达式中没有“通配符”时，我很惊讶fixed
的速度要快得多。好吧，在我的具体示例中，使用flodel（在问题的评论中）建议的pmatch
，效率要高得多。然而，出于一般目的，您的上述解决方案非常好，而且对于非常长的向量，flodel的正式回复也值得测试。如果你同意的话，我想我会尝试在我的回答中总结所有这些。好吧，对于我的具体例子，按照你在评论中的建议使用pmatch
，效率更高。然而，对于一般用途，Sven的解决方案非常好，对于很长的向量，您的正式回复也值得测试。如果你同意的话，我想我会尝试在我的回答中总结这一切。当然，请放心。我希望您能根据自己的数据和报告时间来测试我们的方法。我测试了5种不同的方法（见下文，我无法测试Martin的方法，因为它不支持a
的某些元素作为a
的其他元素的前缀）。在我的具体案例中，使用pmatch是最有效的，但根据具体情况可能会更好
reduced.match <- function(a, b) {

   first.word <- function(string) sub("_.*", "", string)

   a.first <- first.word(a)
   b.first <- first.word(b)
   l.first <- unique(c(a.first, b.first))
   a.first <- factor(a.first, l.first)
   b.first <- factor(b.first, l.first)
   a.split <- split(a, a.first)
   b.split <- split(b, b.first)
   a.idx.split <- split(seq_along(a), a.first)
   b.idx.split <- split(seq_along(b), b.first)

   unsorted.matches <-
     Map(function(a, b, i) lapply(a, function(x) i[grep(x, b, fixed = TRUE)]),
         a.split, b.split, b.idx.split, USE.NAMES = FALSE)

   sorted.matches <-
     unlist(unsorted.matches, recursive = FALSE)[
       match(seq_along(a), unlist(a.idx.split))]

   return(sorted.matches)
}

# sample data
set.seed(123)
n <- 10000
words <- paste0(LETTERS, LETTERS, LETTERS)
a <- paste(sample(words[-1], n, TRUE),
           sample(words, n, TRUE), sep = "_")
b <- paste(sample(words[-2], n, TRUE),
           sample(words, n, TRUE), sep = "_")

# testing
identical(reduced.match(a, b), lapply(a, grep, b, fixed = TRUE))
# [1] TRUE

# benchmarks
system.time(reduced.match(a, b))
#    user  system elapsed 
#   0.187   0.000   0.187 
system.time(lapply(a, grep, b, fixed = TRUE))
#    user  system elapsed 
#   2.915   0.002   2.920 

f3 <- local({
    library(inline)
    .amatch <- cfunction(c(a="character", b="character"),
             includes="#include <string.h>", '
         int len_a = Rf_length(a), len_b = Rf_length(b);
         SEXP ans = PROTECT(allocVector(INTSXP, len_b));
         memset(INTEGER(ans), 0, sizeof(int) * len_b);
         int cmp, i = 0, j = 0;
         while (i < len_a) {
             const char *ap = CHAR(STRING_ELT(a, i));
             while (j < len_b) {
                 cmp = strncmp(ap, CHAR(STRING_ELT(b, j)), strlen(ap));
                 if (cmp > 0) {
                     j += 1;
                 } else break;
             }
             if (j == len_b)
                 break;
             if (cmp == 0)
                 INTEGER(ans)[j++] = i + 1;
             else if (cmp < 0) i += 1;
         }
         UNPROTECT(1);
         return(ans);')

    function(a, b) {
        locale = Sys.getlocale("LC_COLLATE")
        if (locale != "C") {
            warning('temporarily trying to set LC_COLLATE to "C"')
            Sys.setlocale("LC_COLLATE", "C")
            on.exit(Sys.setlocale("LC_COLLATE", locale))
        }
        a0 <- a
        lvls <- unique(a)
        a <- sort(lvls)
        o <- order(b)
        idx <- .amatch(a, b[o])[order(o)]
        f <- factor(a[idx[idx != 0]], levels=lvls)
        split(which(idx != 0), f)[a0]
    }
})

f0 <- function(a, b) {
    a0 <- a
    a <- unique(a)
    names(a) <- a
    lapply(a, grep, b, fixed=TRUE)[a0]
}

> microbenchmark(f0(a, b), f3(a, b), times=5)
Unit: milliseconds
     expr       min        lq    median        uq       max neval
 f0(a, b) 431.03595 431.45211 432.59346 433.96036 434.87550     5
 f3(a, b)  15.70972  15.75976  15.93179  16.05184  16.06767     5

> str(f0(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr "abc"
> str(f3(c("a", "ab"), "abc"))
List of 2
 $ : chr "abc"
 $ : chr(0) 

set.seed(123)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste0(genes, "_", rep(1:50, each=length(genes)), "_")
b0 <- paste0(a0, "1")
ite <- 50
lg <- 1000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

> identical(unname(f3(a, b)), lapply(a, grep, b, fixed=TRUE))
[1] TRUE

# set up the data set
library(microbenchmark)
categ <- c("Control", "Gr", "Or", "PMT", "P450")
genes <- paste(categ, rep(1:40, each=length(categ)), sep="_")
a0 <- paste(genes, "_", rep(1:50, each=length(genes)), "_", sep="")
b0 <- paste (a0, "1", sep="")

# length(a)==200 & length(b)==2,000
ite <- 200
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr        min         lq     median
               as.vector(sapply(a, grep, b)) 188.810585 189.256705 189.827765
  as.vector(sapply(paste0("^", a), grep, b)) 157.600510 158.113507 158.560619
 as.vector(sapply(a, grep, b, fixed = TRUE))  23.954520  24.109275  24.269991
                 unlist(reduced.match(a, b))   7.999203   8.087931   8.140260
                                pmatch(a, b)   7.459394   7.489923   7.586329
         uq        max neval
 191.412879 222.131220   100
 160.129008 186.695822   100
  25.923741  26.380578   100
   8.237207  10.063783   100
   7.637560   7.888938   100


# length(a)==400 & length(b)==2,000
ite <- 400
lg <- 2000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr       min        lq    median
               as.vector(sapply(a, grep, b)) 376.85638 379.58441 380.46107
  as.vector(sapply(paste0("^", a), grep, b)) 314.38333 316.79849 318.33426
 as.vector(sapply(a, grep, b, fixed = TRUE))  49.56848  51.54113  51.90420
                 unlist(reduced.match(a, b))  13.31185  13.44923  13.57679
                                pmatch(a, b)  15.15788  15.24773  15.36917
        uq       max neval
 383.26959 415.23281   100
 320.92588 346.66234   100
  52.02379  81.65053   100
  15.56503  16.83750   100
  15.45680  17.58592   100


# length(a)==200 & length(b)==10,000
ite <- 200
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr       min        lq    median
               as.vector(sapply(a, grep, b)) 975.34831 978.55579 981.56864
  as.vector(sapply(paste0("^", a), grep, b)) 808.79299 811.64919 814.16552
 as.vector(sapply(a, grep, b, fixed = TRUE)) 119.64240 120.41718 120.73548
                 unlist(reduced.match(a, b))  34.23893  34.56048  36.23506
                                pmatch(a, b)  37.57552  37.82128  38.01727
        uq        max neval
 986.17827 1061.89808   100
 824.41931  854.26298   100
 121.20605  151.43524   100
  36.57896   43.33285   100
  38.21910   40.87238   100



# length(a)==400 & length(b)==10500
ite <- 400
lg <- 10000
b <- b0[1:lg]
a <- (a0[1:lg])[sample(seq(lg), ite)]

microbenchmark(as.vector(sapply(a, grep, b)),                 # original
               as.vector(sapply(paste0("^", a), grep, b)),  # @flodel 1
               as.vector(sapply(a, grep, b, fixed = TRUE)), # Sven Hohenstein
               unlist(reduced.match(a, b)), # @ flodel 2
#~               f3(a, b), @Martin Morgan
               pmatch(a, b))

Unit: milliseconds
                                        expr        min         lq     median
               as.vector(sapply(a, grep, b)) 1977.69564 2003.73443 2028.72239
  as.vector(sapply(paste0("^", a), grep, b)) 1637.46903 1659.96661 1677.21706
 as.vector(sapply(a, grep, b, fixed = TRUE))  236.81745  238.62842  239.67875
                 unlist(reduced.match(a, b))   57.18344   59.09308   59.48678
                                pmatch(a, b)   75.03812   75.40420   75.60641
         uq        max neval
 2076.45628 2223.94624   100
 1708.86306 1905.16534   100
  241.12830  283.23043   100
   59.76167   88.71846   100
   75.99034   90.62689   100