R 使用长字符将多个片段匹配到多个位置
我的问题是: 数据R 使用长字符将多个片段匹配到多个位置,r,character,R,Character,我的问题是: 数据 我相信它是从一端匹配,返回NA。我想要的是,这些字符可以匹配到任何地方,但需要在同一序列。如果重复两次,我也需要这些信息…谢谢…我知道stringr包中的stru count函数: set.seed(123) randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "") bigse <- randDNA(10000)
我相信它是从一端匹配,返回NA。我想要的是,这些字符可以匹配到任何地方,但需要在同一序列。如果重复两次,我也需要这些信息…谢谢…我知道
stringr
包中的stru count
函数:
set.seed(123)
randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "")
bigse <- randDNA(10000)
str_count(bigse,c("ATTGG", "TTTTT", "CCCCC", "ATATG"))
[1] 2 9 5 6
set.seed(123)
randDNA=功能(n)粘贴(样本(c(“A”、“c”、“T”、“G”),n,替换=真),折叠=)
bigse如果pkg:Biostrings屏蔽了基本函数pmatch
,我会感到惊讶,我猜您正在看到base::pmatch
的预期行为。像joran一样,我认为你的5000万长度测试太长了,以至于我将它配对到了我成功使用matchPattern
“Biostrings”调用的输出可以放在这个页面上的某个地方,但一定要滚动过去进行完整的测试。在全面测试中,速度惊人地快。实际上比构造字符串快得多
我研究了stringr的不同之处,发现它与如何计算与高度重复的片段的匹配有关。鉴于你正在研究生物数据,我想我会接受Biostrings公约,除非你有具体的理由不接受。在这种情况下,您应该进一步了解函数的细节和更完整的输出
set.seed(123)
randDNA = function(n) paste(sample(c("A", "C", "T", "G"),
n, replace = TRUE), collapse = "")
bigse <- randDNA(10000)
# There is a countPattern function that might narrowly give you what you wanted.
sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), countPattern, subject=bigse)
#ATTGG TTTTT CCCCC ATATG
# 2 11 7 6
但只是为了好玩,我用你的大绳子
sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), countPattern, subject=bigse)
# ATTGG TTTTT CCCCC ATATG
# 48850 48933 49111 49073
以下是速度比较:
> system.time( sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"),
countPattern, subject=bigse) )
user system elapsed
1.507 0.119 1.618
> system.time(str_count(bigse,c("ATTGG", "TTTTT", "CCCCC", "ATATG")))
user system elapsed
6.332 0.017 6.337
# Added the gregexpr solution timing (not surprising to see similarity with stingr times)
> system.time( sapply(motif,function(x) length(gregexpr(x,bigse)[[1]])) )
user system elapsed
6.768 0.046 6.794
如果你想坚持基本R,你可以这样做:
set.seed(123)
randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "")
bigse = randDNA(10000)
motif = c("ATTGG", "TTTTT", "CCCCC", "ATATG")
sapply(motif,function(x) length(gregexpr(x,bigse)[[1]]))
ATTGG TTTTT CCCCC ATATG
2 9 5 6
你试过生物导体列表上的这个问题吗?我想那里有更多的专业知识。。。(一定要提到你交叉发帖是有原因的)
sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"), countPattern, subject=bigse)
# ATTGG TTTTT CCCCC ATATG
# 48850 48933 49111 49073
> system.time( sapply(c("ATTGG", "TTTTT", "CCCCC", "ATATG"),
countPattern, subject=bigse) )
user system elapsed
1.507 0.119 1.618
> system.time(str_count(bigse,c("ATTGG", "TTTTT", "CCCCC", "ATATG")))
user system elapsed
6.332 0.017 6.337
# Added the gregexpr solution timing (not surprising to see similarity with stingr times)
> system.time( sapply(motif,function(x) length(gregexpr(x,bigse)[[1]])) )
user system elapsed
6.768 0.046 6.794
set.seed(123)
randDNA = function(n) paste(sample(c("A", "C", "T", "G"), n, replace = TRUE), collapse = "")
bigse = randDNA(10000)
motif = c("ATTGG", "TTTTT", "CCCCC", "ATATG")
sapply(motif,function(x) length(gregexpr(x,bigse)[[1]]))
ATTGG TTTTT CCCCC ATATG
2 9 5 6