String R语言中的快速部分字符串匹配_String_R_Performance_String Matching

String R语言中的快速部分字符串匹配

string r performance

String R语言中的快速部分字符串匹配,string,r,performance,string-matching,String,R,Performance,String Matching,给定一个字符串向量文本和一个模式向量模式，我想为每个文本找到任何匹配模式对于小型数据集，这可以在R中通过grepl轻松完成： patterns = c("some","pattern","a","horse") texts = c("this is a text with some pattern", "this is another text with a pattern") # for each x in patterns lapply( patterns, function(x){

给定一个字符串向量

文本

和一个模式向量

模式

，我想为每个文本找到任何匹配模式

对于小型数据集，这可以在R中通过

grepl

轻松完成：

patterns = c("some","pattern","a","horse")
texts = c("this is a text with some pattern", "this is another text with a pattern")

# for each x in patterns
lapply( patterns, function(x){
  # match all texts against pattern x
  res = grepl( x, texts, fixed=TRUE )
  print(res)
  # do something with the matches
  # ...
})

这个解决方案是正确的，但它不能放大。即使有中等规模的数据集（约500个文本和模式），这段代码的速度也令人尴尬，在现代机器上每秒只能解决大约100个案例——考虑到这是一个粗糙的字符串部分匹配，没有正则表达式（设置为

fixed=TRUE

），这是荒谬的。即使使

lappy

并行也不能解决问题。有没有办法有效地重新编写此代码

谢谢，

Mulone

您是否准确描述了您的问题和您看到的性能？以下是对他们的调查和质疑

text = readLines("~/Downloads/pg100.txt")
pattern <- 
    strsplit("all the world's a stage and all the people players", " ")[[1]]

使用

stringi

package——它甚至比grepl更快。检查基准！我使用了@Martin Morgan post的文本

require(stringi)
require(microbenchmark)

text = readLines("~/Desktop/pg100.txt")
pattern <-  strsplit("all the world's a stage and all the people players", " ")[[1]]

grepl_fun <- function(){
    lapply(pattern, grepl, text, fixed=TRUE)
}

stri_fixed_fun <- function(){
    lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
}

#        microbenchmark(grepl_fun(), stri_fixed_fun())
#    Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#          grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509   100
#     stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913   100

# if you don't believe me that the results are equal, you can check :)
xx <- grepl_fun()
stri <- stri_fixed_fun()

for(i in seq_along(xx)){
    print(all(xx[[i]] == stri[[i]]))
}

require（stringi）
要求（微基准）
text=readLines（“~/Desktop/pg100.txt”）
你的句型都是单字吗？您是否只对模式的每个元素是否出现在文本的一个或多个元素中感兴趣（或者您需要知道它们出现在文本的哪些元素中）？
> idx = Reduce("+", lapply(pattern, grepl, text, fixed=TRUE))
> range(idx)
[1] 0 7
> sum(idx == 7)
[1] 8
> text[idx == 7]
[1] "    And all the men and women merely players;"                       
[2] "    cicatrices to show the people when he shall stand for his place."
[3] "    Scandal'd the suppliants for the people, call'd them"            
[4] "    all power from the people, and to pluck from them their tribunes"
[5] "    the fashion, and so berattle the common stages (so they call"    
[6] "    Which God shall guard; and put the world's whole strength"       
[7] "    Of all his people and freeze up their zeal,"                     
[8] "    the world's end after my name-call them all Pandars; let all"    

require(stringi)
require(microbenchmark)

text = readLines("~/Desktop/pg100.txt")
pattern <-  strsplit("all the world's a stage and all the people players", " ")[[1]]

grepl_fun <- function(){
    lapply(pattern, grepl, text, fixed=TRUE)
}

stri_fixed_fun <- function(){
    lapply(pattern, function(x) stri_detect_fixed(text,x,NA))
}

#        microbenchmark(grepl_fun(), stri_fixed_fun())
#    Unit: milliseconds
#                 expr      min       lq   median       uq      max neval
#          grepl_fun() 432.9336 435.9666 446.2303 453.9374 517.1509   100
#     stri_fixed_fun() 213.2911 218.1606 227.6688 232.9325 285.9913   100

# if you don't believe me that the results are equal, you can check :)
xx <- grepl_fun()
stri <- stri_fixed_fun()

for(i in seq_along(xx)){
    print(all(xx[[i]] == stri[[i]]))
}