Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/r/71.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R语言中字符匹配的快速方法_R_Character_Match_Grepl - Fatal编程技术网

R语言中字符匹配的快速方法

R语言中字符匹配的快速方法,r,character,match,grepl,R,Character,Match,Grepl,我试图找出字符的向量是否映射到另一个字符,并在R中寻找一种快速的方法 具体来说,我的字符字母表是氨基酸: aa.LETTERS <- c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T') 另一种是使用BiostringsBioconductor包: require(Biostrings) peptides.set <- AAStringSet(x=peptides.vec

我试图找出
字符的
向量
是否映射到另一个字符,并在
R
中寻找一种快速的方法

具体来说,我的字符字母表是氨基酸:

aa.LETTERS <- c('G','P','A','V','L','I','M','C','F','Y','W','H','K','R','Q','N','E','D','S','T')
另一种是使用
Biostrings
Bioconductor
包:

require(Biostrings)
peptides.set <- AAStringSet(x=peptides.vec)
proteins.set <- AAStringSet(x=proteins.vec)
mapping.mat <- vcountPDict(peptides.set,proteins.set)

你知道如何更快地完成吗?

正如我在评论中提到的,添加
fixed=TRUE
将带来一些性能改进,而“stringi”可能也会带来很好的提升

以下是一些测试:

N <- as.integer(length(proteins.vec))

funOP <- function() {
  do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec)))
}

funBASE_1 <- function() {
  # Just adds "fixed = TRUE"
  do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec, fixed = TRUE)))
}

funBASE_2 <- function() {
  # Does away with the `do.call` but probably won't improve performance
  vapply(peptides.vec, function(x) grepl(x, proteins.vec, fixed = TRUE), logical(N))
}

library(stringi)
funSTRINGI <- function() {
  # Should be considerably faster
  vapply(peptides.vec, function(x) stri_detect_fixed(proteins.vec, x), logical(N))
}

library(microbenchmark)
microbenchmark(funOP(), funBASE_1(), funBASE_2(), funSTRINGI())
# Unit: milliseconds
#          expr        min         lq      mean     median         uq       max neval
#       funOP() 344.500600 348.562879 352.94847 351.585206 356.508197 371.99683   100
#   funBASE_1() 128.724523 129.763464 132.55028 132.198112 135.277821 139.65782   100
#   funBASE_2() 128.564914 129.831660 132.33836 131.607216 134.380077 140.46987   100
#  funSTRINGI()   8.629728   8.825296   9.22318   9.038496   9.444376  11.28491   100

N在我的头顶上(无测试),添加一个
fixed=TRUE
通常会提供很好的速度提升。另请参见“stringi”包。(另一方面,代码中不包含空格并不会提高其性能。)实际上,在某些情况下,取决于您使用的实际数据的性质,<代码> FunBase2可能会更快——至少这是我在一些测试中得到的……你会考虑使用多核版本吗?虽然proteins.vec实际上很漂亮large@dan,老实说,我还没有这方面的经验,但我想这将是一个很好的候选人。
require(Biostrings)
peptides.set <- AAStringSet(x=peptides.vec)
proteins.set <- AAStringSet(x=proteins.vec)
mapping.mat <- vcountPDict(peptides.set,proteins.set)
> microbenchmark(do.call(rbind,lapply(peptides.vec,function(p){
   grepl(p,proteins.vec)
 })),times=100)
Unit: milliseconds
                                                                             expr      min       lq     mean   median       uq      max neval
 do.call(rbind, lapply(peptides.vec, function(p) {     grepl(p, proteins.vec) })) 477.2509 478.8714 482.8937 480.4398 484.3076 509.8098   100
> microbenchmark(vcountPDict(peptides.set,proteins.set),times=100)
Unit: milliseconds
                                    expr    min       lq     mean   median       uq      max neval
 vcountPDict(peptides.set, proteins.set) 283.32 284.3334 285.0205 284.7867 285.2467 290.6725   100
N <- as.integer(length(proteins.vec))

funOP <- function() {
  do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec)))
}

funBASE_1 <- function() {
  # Just adds "fixed = TRUE"
  do.call(rbind, lapply(peptides.vec, function(p) grepl(p, proteins.vec, fixed = TRUE)))
}

funBASE_2 <- function() {
  # Does away with the `do.call` but probably won't improve performance
  vapply(peptides.vec, function(x) grepl(x, proteins.vec, fixed = TRUE), logical(N))
}

library(stringi)
funSTRINGI <- function() {
  # Should be considerably faster
  vapply(peptides.vec, function(x) stri_detect_fixed(proteins.vec, x), logical(N))
}

library(microbenchmark)
microbenchmark(funOP(), funBASE_1(), funBASE_2(), funSTRINGI())
# Unit: milliseconds
#          expr        min         lq      mean     median         uq       max neval
#       funOP() 344.500600 348.562879 352.94847 351.585206 356.508197 371.99683   100
#   funBASE_1() 128.724523 129.763464 132.55028 132.198112 135.277821 139.65782   100
#   funBASE_2() 128.564914 129.831660 132.33836 131.607216 134.380077 140.46987   100
#  funSTRINGI()   8.629728   8.825296   9.22318   9.038496   9.444376  11.28491   100