R 如何在向量序列中索引向量序列
我有一个涉及循环的问题的解决方案,而且很有效,但我觉得我缺少了一些涉及更高效实现的东西。问题是:我有一个数字向量序列,想要识别第一个向量的另一个向量中的起始位置 它的工作原理如下:R 如何在向量序列中索引向量序列,r,performance,vector,R,Performance,Vector,我有一个涉及循环的问题的解决方案,而且很有效,但我觉得我缺少了一些涉及更高效实现的东西。问题是:我有一个数字向量序列,想要识别第一个向量的另一个向量中的起始位置 它的工作原理如下: # helper function for matchSequence # wraps a vector by removing the first n elements and padding end with NAs wrapVector <- function(x, n) { stopifnot(
# helper function for matchSequence
# wraps a vector by removing the first n elements and padding end with NAs
wrapVector <- function(x, n) {
stopifnot(n <= length(x))
if (n == length(x))
return(rep(NA, n))
else
return(c(x[(n+1):length(x)], rep(NA, n)))
}
wrapVector(LETTERS[1:5], 1)
## [1] "B" "C" "D" "E" NA
wrapVector(LETTERS[1:5], 2)
## [1] "C" "D" "E" NA NA
# returns the starting index positions of the sequence found in a vector
matchSequence <- function(seq, vec) {
matches <- seq[1] == vec
if (length(seq) == 1) return(which(matches))
for (i in 2:length(seq)) {
matches <- cbind(matches, seq[i] == wrapVector(vec, i - 1))
}
which(rowSums(matches) == i)
}
myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence(1:2, myVector)
## [1] 3 7
matchSequence(c(4, 1, 1), myVector)
## [1] 5
matchSequence(1:3, myVector)
## integer(0)
这里有一个稍微不同的想法:
f <- function(seq, vec) {
mm <- t(embed(vec, length(seq))) == rev(seq) ## relies on recycling of seq
which(apply(mm, 2, all))
}
myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
f(1:2, myVector)
# [1] 3 7
f(c(4,1,1), myVector)
# [1] 5
f(1:3, myVector)
# integer(0)
f还有另一种方法:
myVector <- c(3, NA, 1, 2, 4, 1, 1, 2)
matchSequence <- function(seq,vec) {
n.vec <- length(vec)
n.seq <- length(seq)
which(sapply(1:(n.vec-n.seq+1),function(i)all(head(vec[i:n.vec],n.seq)==seq)))
}
matchSequence(1:2,myVector)
# [1] 3 7
matchSequence(c(4,1,1),myVector)
# [1] 5
matchSequence(1:3,myVector)
# integer(0)
myVector另一个想法:
match_seq2 <- function(s,v){
n = length(s)
nc = length(v)-n+1
which(
n == rowsum(
as.integer(v[ rep(0:(n-1), nc) + rep(1:nc, each=n) ] == s),
rep(seq(nc),each=n)
)
)
}
match_seq <- function(s, v) Filter(
function(i) all.equal( s, v[i + seq_along(s) - 1] ),
which( v == s[1] )
)
# examples:
my_vec <- c(3, NA, 1, 2, 4, 1, 1, 2)
match_seq(1:2, my_vec) # 3 7
match_seq(c(4,1,1), my_vec) # 5
match_seq(1:3, my_vec) # integer(0)
我使用的是all.equal
而不是idential
,因为OP希望整数1:2
匹配数值c(1,2)
。这种方法引入了另一种情况,即允许对my_vec
末尾以外的点进行匹配(索引时为NA
):
OP的基准
# variant on Josh's, suggested by OP:
f2 <- function(seq, vec) {
mm <- t(embed(vec, length(seq))) == rev(seq) ## relies on recycling of seq
which(colSums(mm)==length(seq))
}
my_check <- function(values) {
all(sapply(values[-1], function(x) identical(values[[1]], x)))
}
set.seed(100)
my_vec2 <- sample(c(NA, 1:4), size = 1000, replace = TRUE)
s <- c(4,1,1)
microbenchmark(
op = matchSequence(s, my_vec2),
josh = f(s, my_vec2),
josh2 = f2(s, my_vec2),
frank = match_seq(s, my_vec2),
frank2 = match_seq2(s, my_vec2),
jlh = matchSequence2(s, my_vec2),
tlm = flm(s, my_vec2),
alexis = find_pat(s, my_vec2),
unit = "relative", check=my_check)
亚历克西斯·拉兹赢了
(请随时更新。请参阅alexis的答案以获取更多基准。)我认为另一次尝试更快。这要归功于它的速度,因为它只检查向量中与搜索序列开头匹配的点的匹配
flm <- function(sq, vec) {
hits <- which(sq[1]==vec)
out <- hits[
colSums(outer(0:(length(sq)-1), hits, function(x,y) vec[x+y]) == sq)==length(sq)
]
out[!is.na(out)]
}
和一个递归的想法(在2016年2月5日编辑,在模式中使用NA
s):
在基准上:
all.equal(matchSequence(s, my_vec2), find_pat(s, my_vec2))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(s, my_vec2),
flm(s, my_vec2),
find_pat(s, my_vec2),
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# matchSequence(s, my_vec2) 2.970888 3.096573 3.068802 3.023167 12.41387 100
# flm(s, my_vec2) 1.140777 1.173043 1.258394 1.280753 12.79848 100
# find_pat(s, my_vec2) 1.000000 1.000000 1.000000 1.000000 1.00000 100
使用较大的数据:
set.seed(911); VEC = sample(c(NA, 1:3), 1e6, TRUE); PAT = c(3, 2, 2, 1, 3, 2, 2, 1, 1, 3)
all.equal(matchSequence(PAT, VEC), find_pat(PAT, VEC))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(PAT, VEC),
flm(PAT, VEC),
find_pat(PAT, VEC),
unit = "relative", times = 20)
#Unit: relative
# expr min lq median uq max neval
# matchSequence(PAT, VEC) 23.106862 20.54601 19.831344 18.677528 12.563634 20
# flm(PAT, VEC) 2.810611 2.51955 2.963352 2.877195 1.728512 20
# find_pat(PAT, VEC) 1.000000 1.00000 1.000000 1.000000 1.000000 20
我测试了答案。你的比乔希的快5倍;比我的和howard的例子快50倍。非常好地使用了embed()
!正如@Frank所说的,这要慢一些,但这只是因为apply(,all)。虽然哪个(应用(mm,2,all))
更优雅,但速度要慢得多。将该行更改为which(colSums(matches)=length(vec))
,它会快10倍。很好!我的过滤器方法也仅从这些点进行检查,但由于矢量化,你的方法可能要快得多。棒极了!基于性能和节约,我接受了@thelatemail的回答,但非常感谢您的分析和比较。谢谢大家。使用类似于'%==%'@rawr:你说得对;除非我忽略了什么,否则当NA
s处于“pat”状态时,我添加了一个更简单的解决方法。
flm <- function(sq, vec) {
hits <- which(sq[1]==vec)
out <- hits[
colSums(outer(0:(length(sq)-1), hits, function(x,y) vec[x+y]) == sq)==length(sq)
]
out[!is.na(out)]
}
#Unit: relative
# expr min lq mean median uq max neval
# josh2 2.469769 2.393794 2.181521 2.353438 2.345911 1.51641 100
# lm 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000 100
find_pat = function(pat, x)
{
ff = function(.pat, .x, acc = if(length(.pat)) seq_along(.x) else integer(0L)) {
if(!length(.pat)) return(acc)
if(is.na(.pat[[1L]]))
Recall(.pat[-1L], .x, acc[which(is.na(.x[acc]))] + 1L)
else
Recall(.pat[-1L], .x, acc[which(.pat[[1L]] == .x[acc])] + 1L)
}
return(ff(pat, x) - length(pat))
}
find_pat(1:2, myVector)
#[1] 3 7
find_pat(c(4, 1, 1), myVector)
#[1] 5
find_pat(1:3, myVector)
#integer(0)
find_pat(c(NA, 1), myVector)
#[1] 2
find_pat(c(3, NA), myVector)
#[1] 1
all.equal(matchSequence(s, my_vec2), find_pat(s, my_vec2))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(s, my_vec2),
flm(s, my_vec2),
find_pat(s, my_vec2),
unit = "relative")
#Unit: relative
# expr min lq median uq max neval
# matchSequence(s, my_vec2) 2.970888 3.096573 3.068802 3.023167 12.41387 100
# flm(s, my_vec2) 1.140777 1.173043 1.258394 1.280753 12.79848 100
# find_pat(s, my_vec2) 1.000000 1.000000 1.000000 1.000000 1.00000 100
set.seed(911); VEC = sample(c(NA, 1:3), 1e6, TRUE); PAT = c(3, 2, 2, 1, 3, 2, 2, 1, 1, 3)
all.equal(matchSequence(PAT, VEC), find_pat(PAT, VEC))
#[1] TRUE
microbenchmark::microbenchmark(matchSequence(PAT, VEC),
flm(PAT, VEC),
find_pat(PAT, VEC),
unit = "relative", times = 20)
#Unit: relative
# expr min lq median uq max neval
# matchSequence(PAT, VEC) 23.106862 20.54601 19.831344 18.677528 12.563634 20
# flm(PAT, VEC) 2.810611 2.51955 2.963352 2.877195 1.728512 20
# find_pat(PAT, VEC) 1.000000 1.00000 1.000000 1.000000 1.000000 20