Warning: file_get_contents(/data/phpspider/zhask/data//catemap/5/ruby/22.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
R 使用长字符将多个片段匹配到多个位置_R_Character - Fatal编程技术网

R 使用长字符将多个片段匹配到多个位置

R 使用长字符将多个片段匹配到多个位置,r,character,R,Character,我的问题是: 数据 我相信它是从一端匹配,返回NA。我想要的是,这些字符可以匹配到任何地方,但需要在同一序列。如果重复两次,我也需要这些信息…谢谢…我知道stringr包中的stru count函数: set.seed(123) randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "") bigse <- randDNA(10000)

我的问题是:

数据
我相信它是从一端匹配,返回NA。我想要的是,这些字符可以匹配到任何地方,但需要在同一序列。如果重复两次,我也需要这些信息…谢谢…

我知道
stringr
包中的
stru count
函数:

set.seed(123) 
randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "")            
bigse <- randDNA(10000) 

str_count(bigse,c("ATTGG", "TTTTT", "CCCCC", "ATATG"))
[1] 2 9 5 6
set.seed(123)
randDNA=功能(n)粘贴(样本(c(“A”、“c”、“T”、“G”),n,替换=真),折叠=)

bigse如果pkg:Biostrings屏蔽了基本函数
pmatch
,我会感到惊讶,我猜您正在看到
base::pmatch
的预期行为。像joran一样,我认为你的5000万长度测试太长了,以至于我将它配对到了我成功使用
matchPattern
“Biostrings”调用的输出可以放在这个页面上的某个地方,但一定要滚动过去进行完整的测试。在全面测试中,速度惊人地快。实际上比构造字符串快得多

我研究了stringr的不同之处,发现它与如何计算与高度重复的片段的匹配有关。鉴于你正在研究生物数据,我想我会接受Biostrings公约,除非你有具体的理由不接受。在这种情况下,您应该进一步了解函数的细节和更完整的输出

set.seed(123) 
randDNA = function(n) paste(sample(c("A", "C", "T", "G"), 
          n, replace = TRUE), collapse = "")            
bigse <- randDNA(10000)
# There is  a countPattern function that might  narrowly give you what you wanted.
 sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), countPattern, subject=bigse)

#ATTGG TTTTT CCCCC ATATG 
#    2    11     7     6 
但只是为了好玩,我用你的大绳子

 sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), countPattern, subject=bigse)
# ATTGG TTTTT CCCCC ATATG 
# 48850 48933 49111 49073 
以下是速度比较:

> system.time( sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), 
                   countPattern, subject=bigse) )
   user  system elapsed 
  1.507   0.119   1.618 

> system.time(str_count(bigse,c("ATTGG", "TTTTT", "CCCCC", "ATATG")))
   user  system elapsed 
  6.332   0.017   6.337 

# Added the gregexpr solution timing (not surprising to see similarity with stingr times)
> system.time( sapply(motif,function(x) length(gregexpr(x,bigse)[[1]])) )
   user  system elapsed 
  6.768   0.046   6.794 

如果你想坚持基本R,你可以这样做:

set.seed(123) 
randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "")            
bigse = randDNA(10000) 
motif = c("ATTGG", "TTTTT", "CCCCC", "ATATG")

sapply(motif,function(x) length(gregexpr(x,bigse)[[1]]))
ATTGG TTTTT CCCCC ATATG 
    2     9     5     6 

你试过生物导体列表上的这个问题吗?我想那里有更多的专业知识。。。(一定要提到你交叉发帖是有原因的)
 sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), countPattern, subject=bigse)
# ATTGG TTTTT CCCCC ATATG 
# 48850 48933 49111 49073 
> system.time( sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), 
                   countPattern, subject=bigse) )
   user  system elapsed 
  1.507   0.119   1.618 

> system.time(str_count(bigse,c("ATTGG", "TTTTT", "CCCCC", "ATATG")))
   user  system elapsed 
  6.332   0.017   6.337 

# Added the gregexpr solution timing (not surprising to see similarity with stingr times)
> system.time( sapply(motif,function(x) length(gregexpr(x,bigse)[[1]])) )
   user  system elapsed 
  6.768   0.046   6.794 
set.seed(123) 
randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "")            
bigse = randDNA(10000) 
motif = c("ATTGG", "TTTTT", "CCCCC", "ATATG")

sapply(motif,function(x) length(gregexpr(x,bigse)[[1]]))
ATTGG TTTTT CCCCC ATATG 
    2     9     5     6