Python 查找最长的相邻重复非重叠子字符串
(这个问题与音乐无关,但我以音乐为例 (一个用例。) 在音乐中,短语结构的一种常见方式是按音符序列 中间部分重复一次或多次。因此,这个短语 由引言、循环部分和输出部分组成。这里有一个 例如: 我们可以“看到”介绍是[E],重复部分是[F G A] F]输出为[cd]。因此,拆分列表的方法是Python 查找最长的相邻重复非重叠子字符串,python,algorithm,language-agnostic,substring,string-algorithm,Python,Algorithm,Language Agnostic,Substring,String Algorithm,(这个问题与音乐无关,但我以音乐为例 (一个用例。) 在音乐中,短语结构的一种常见方式是按音符序列 中间部分重复一次或多次。因此,这个短语 由引言、循环部分和输出部分组成。这里有一个 例如: 我们可以“看到”介绍是[E],重复部分是[F G A] F]输出为[cd]。因此,拆分列表的方法是 [ [ E E E ] 3 [ F G A F ] [ C D ] ] 其中第一项是简介,第二项是 重复部分重复,第三部分输出 我需要一个算法来执行这样的拆分 但有一点需要注意,那就是可能有多种方法 拆分列
[ [ E E E ] 3 [ F G A F ] [ C D ] ]
其中第一项是简介,第二项是
重复部分重复,第三部分输出
我需要一个算法来执行这样的拆分
但有一点需要注意,那就是可能有多种方法
拆分列表。例如,上述列表可分为:
[ [ E E E F G A ] 2 [ F F G A ] [ F C D ] ]
但这是一个更糟糕的分裂,因为介绍和介绍更长。所以
该算法的标准是找到最大化
环件的长度,并使环件的组合长度最小
开场白和开场白。这意味着正确的分割
[ A C C C C C C C C C A ]
是
因为导入和导出的组合长度是2,而
循环部分的长度为9
此外,虽然intro和outro可以是空的,但只有“true”重复是空的
允许。因此,不允许进行以下拆分:
[ [ ] 1 [ E E E F G A F F G A F F G A F C D ] [ ] ]
可以将其视为为为数据找到最佳的“压缩”
序列请注意,在某些序列中可能没有任何重复:
[ A B C D ]
对于这些退化情况,任何合理的结果都是允许的
以下是我对算法的实现:
def find_longest_repeating_non_overlapping_subseq(seq):
candidates = []
for i in range(len(seq)):
candidate_max = len(seq[i + 1:]) // 2
for j in range(1, candidate_max + 1):
candidate, remaining = seq[i:i + j], seq[i + j:]
n_reps = 1
len_candidate = len(candidate)
while remaining[:len_candidate] == candidate:
n_reps += 1
remaining = remaining[len_candidate:]
if n_reps > 1:
candidates.append((seq[:i], n_reps,
candidate, remaining))
if not candidates:
return (type(seq)(), 1, seq, type(seq)())
def score_candidate(candidate):
intro, reps, loop, outro = candidate
return reps - len(intro) - len(outro)
return sorted(candidates, key = score_candidate)[-1]
我不确定它是否正确,但它通过了我做的简单测试
描述。问题是,这是一种缓慢的方式。我已经看过了
在后缀树上,但它们似乎不适合我的用例,因为
我要找的子字符串应该是不重叠和相邻的。看起来你要做的几乎就是压缩算法。您可以对照我链接到的维基百科文章中的参考实现检查代码。以下是我对您所说内容的实现。它与您的非常相似,但它跳过了子字符串,这些子字符串已被检查为先前子字符串的重复
from collections import namedtuple
SubSequence = namedtuple('SubSequence', ['start', 'length', 'reps'])
def longest_repeating_subseq(original: str):
winner = SubSequence(start=0, length=0, reps=0)
checked = set()
subsequences = ( # Evaluates lazily during iteration
SubSequence(start=start, length=length, reps=1)
for start in range(len(original))
for length in range(1, len(original) - start)
if (start, length) not in checked)
for s in subsequences:
subseq = original[s.start : s.start + s.length]
for reps, next_start in enumerate(
range(s.start + s.length, len(original), s.length),
start=1):
if subseq != original[next_start : next_start + s.length]:
break
else:
checked.add((next_start, s.length))
s = s._replace(reps=reps)
if s.reps > 1 and (
(s.length * s.reps > winner.length * winner.reps)
or ( # When total lengths are equal, prefer the shorter substring
s.length * s.reps == winner.length * winner.reps
and s.reps > winner.reps)):
winner = s
# Check for default case with no repetitions
if winner.reps == 0:
winner = SubSequence(start=0, length=len(original), reps=1)
return (
original[ : winner.start],
winner.reps,
original[winner.start : winner.start + winner.length],
original[winner.start + winner.length * winner.reps : ])
def test(seq, *, expect):
print(f'Testing longest_repeating_subseq for {seq}')
result = longest_repeating_subseq(seq)
print(f'Expected {expect}, got {result}')
print(f'Test {"passed" if result == expect else "failed"}')
print()
if __name__ == '__main__':
test('EEEFGAFFGAFFGAFCD', expect=('EEE', 3, 'FGAF', 'CD'))
test('ACCCCCCCCCA', expect=('A', 9, 'C', 'A'))
test('ABCD', expect=('', 1, 'ABCD', ''))
把你的三个例子都传给我。这似乎是一种可能有很多奇怪的边缘情况的事情,但考虑到这是一种优化的暴力,它可能更像是更新规范的问题,而不是修复代码本身的错误。这里有一种方法显然是二次时间,但常数因子相对较低,因为除了长度为1的子字符串对象外,它不构建任何子字符串对象。结果是一个2元组
bestlen, list_of_results
其中,bestlen
是重复相邻块的最长子串的长度,每个结果是一个3元组
start_index, width, numreps
这意味着正在重复的子字符串是
the_string[start_index : start_index + width]
并且有那些相邻的numreps
。永远都是这样
bestlen == width * numreps
问题描述留下了歧义。例如,考虑这个输出:
>>> crunch2("aaaaaabababa")
(6, [(0, 1, 6), (0, 2, 3), (5, 2, 3), (6, 2, 3), (0, 3, 2)])
因此,它找到了5种方法来将“最长”的拉伸视为长度6:
- 首字母“a”重复6次
- 首字母“aa”重复3次
- “ab”最左边的实例重复了3次
- “ba”最左边的实例重复了3次
- 最初的“aaa”重复了2次
- 介绍是
字符串[:start\u index]
- outro是字符串[start\u index+bestlen:][/code>
(0, [])
其他示例(来自您的帖子):
其工作原理的关键:假设每个相邻重复块的宽度W
。然后考虑当将原始字符串与左边移动的代码进行比较时,代码<>代码<>代码>:< /p>
... block1 block2 ... blockN-1 blockN ...
... block2 block3 ... blockN ... ...
然后在相同的位置获得连续的相等字符。但这也适用于另一个方向:如果向左移动W
并找到(N-1)*W
连续相等的字符,则可以推断:
block1 == block2
block2 == block3
...
blockN-1 == blockN
因此,所有N
块必须是block1的重复
因此,代码重复地将原始字符串左移(副本)一个字符,然后在这两个字符串上从左向右移动,以识别相等字符的最长长度。这只需要一次比较一对字符。为了使“左移”有效(恒定时间),字符串的副本存储在collections.deque
中
编辑:update()
在许多情况下做了太多无用的工作;换了它
def crunch2(s):
from collections import deque
# There are zcount equal characters starting
# at index starti.
def update(starti, zcount):
nonlocal bestlen
while zcount >= width:
numreps = 1 + zcount // width
count = width * numreps
if count >= bestlen:
if count > bestlen:
results.clear()
results.append((starti, width, numreps))
bestlen = count
else:
break
zcount -= 1
starti += 1
bestlen, results = 0, []
t = deque(s)
for width in range(1, len(s) // 2 + 1):
t.popleft()
zcount = 0
for i, (a, b) in enumerate(zip(s, t)):
if a == b:
if not zcount: # new run starts here
starti = i
zcount += 1
# else a != b, so equal run (if any) ended
elif zcount:
update(starti, zcount)
zcount = 0
if zcount:
update(starti, zcount)
return bestlen, results
使用regexp
[由于尺寸限制,已删除此项]
使用后缀数组
这是迄今为止我发现的最快的,尽管仍然可以激发成二次时间行为
请注意,是否找到重叠字符串并不重要。正如上面对crunch2()
程序所解释的(这里以次要方式详述):
- 给定字符串
,长度s
n=len(s)
- 给定具有
0堆栈[-1][0]的int
和i
: stack.append((c,lb)) lcp.pop() 第11段(文本): 从sa导入后缀_数组 sa,秩,lcp=后缀_数组(文本) 最佳,结果=0,[] n=len(文本) #生成分支串联重复。 #(i,c,2)是iff #带前缀文本[i:i+c]的间隔i+c,以及 #i+c不在前缀文本为[i:i+c+1]的子区间中 #注意:这实际上依赖于,在Python 3中, #`range()`返回一个具有O(1)成员身份测试的小对象。 #在Python2中,它返回一个列表——a仍然有效,但非常有用 #慢多了。 def gen_btr(text=text,n=n,sa=sa,rank=rank): 来自itertools进口链 对于genlcpi(lcp)中的c、lb、rb:j
... block1 block2 ... blockN-1 blockN ... ... block2 block3 ... blockN ... ...
block1 == block2 block2 == block3 ... blockN-1 == blockN
def crunch2(s): from collections import deque # There are zcount equal characters starting # at index starti. def update(starti, zcount): nonlocal bestlen while zcount >= width: numreps = 1 + zcount // width count = width * numreps if count >= bestlen: if count > bestlen: results.clear() results.append((starti, width, numreps)) bestlen = count else: break zcount -= 1 starti += 1 bestlen, results = 0, [] t = deque(s) for width in range(1, len(s) // 2 + 1): t.popleft() zcount = 0 for i, (a, b) in enumerate(zip(s, t)): if a == b: if not zcount: # new run starts here starti = i zcount += 1 # else a != b, so equal run (if any) ended elif zcount: update(starti, zcount) zcount = 0 if zcount: update(starti, zcount) return bestlen, results
def crunch4(s): from sa import suffix_array sa, rank, lcp = suffix_array(s) bestlen, results = 0, [] n = len(s) for sai in range(n-1): i = sa[sai] c = n for saj in range(sai + 1, n): c = min(c, lcp[saj]) if not c: break j = sa[saj] w = abs(i - j) if c < w: continue numreps = 1 + c // w assert numreps > 1 total = w * numreps if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((min(i, j), w, numreps)) return bestlen, results
>>> len(xs) 209755 >>> xs.count('\n') 25481
>>> crunch2(xs) (44, [(63750, 22, 2)]) >>> xs[63750 : 63750+50] '\nelectroencephalograph\nelectroencephalography\nelec'
>>> crunch3(xs) (8, [(19308, 4, 2), (47240, 4, 2)]) >>> xs[19308 : 19308+10] 'beriberi\nB' >>> xs[47240 : 47240+10] 'couscous\nc'
>>> crunch3(xs) # with DOTALL (44, [(63750, 22, 2)])
>>> crunch4(xs) (44, [(63750, 22, 2)])
'x' * 1000000
def crunch5(text): from collections import namedtuple, defaultdict # For all integers i and j in IxSet x.s, # text[i : i + x.w] == text[j : j + x.w]. # That is, it's the set of all indices at which a specific # substring of length x.w is found. # In general, we only care about repeated substrings here, # so weed out those that would otherwise have len(x.s) == 1. IxSet = namedtuple("IxSet", "s w") bestlen, results = 0, [] # Compute sets of indices for repeated (not necessarily # adjacent!) substrings of length xs[0].w + ys[0].w, by looking # at the cross product of the index sets in xs and ys. def combine(xs, ys): xw, yw = xs[0].w, ys[0].w neww = xw + yw result = [] for y in ys: shifted = set(i - xw for i in y.s if i >= xw) for x in xs: ok = shifted & x.s if len(ok) > 1: result.append(IxSet(ok, neww)) return result # Check an index set for _adjacent_ repeated substrings. def check(s): nonlocal bestlen x, w = s.s.copy(), s.w while x: current = start = x.pop() count = 1 while current + w in x: count += 1 current += w x.remove(current) while start - w in x: count += 1 start -= w x.remove(start) if count > 1: total = count * w if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((start, w, count)) ch2ixs = defaultdict(set) for i, ch in enumerate(text): ch2ixs[ch].add(i) size1 = [IxSet(s, 1) for s in ch2ixs.values() if len(s) > 1] del ch2ixs for x in size1: check(x) current_class = size1 # Repeatedly increase size by 1 until current_class becomes # empty. At that point, there are no repeated substrings at all # (adjacent or not) of the then-current size (or larger). while current_class: current_class = combine(current_class, size1) for x in current_class: check(x) return bestlen, results
def crunch6(text): from sa import suffix_array sa, rank, lcp = suffix_array(text) bestlen, results = 0, [] n = len(text) # Generate maximal sets of indices s such that for all i and j # in s the suffixes starting at s[i] and s[j] start with a # common prefix of at least len minc. def genixs(minc, sa=sa, lcp=lcp, n=n): i = 1 while i < n: c = lcp[i] if c < minc: i += 1 continue ixs = {sa[i-1], sa[i]} i += 1 while i < n: c = min(c, lcp[i]) if c < minc: yield ixs i += 1 break else: ixs.add(sa[i]) i += 1 else: # ran off the end of lcp yield ixs # Check an index set for _adjacent_ repeated substrings # w apart. CAUTION: this empties s. def check(s, w): nonlocal bestlen while s: current = start = s.pop() count = 1 while current + w in s: count += 1 current += w s.remove(current) while start - w in s: count += 1 start -= w s.remove(start) if count > 1: total = count * w if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((start, w, count)) c = 0 found = True while found: c += 1 found = False for s in genixs(c): found = True check(s, c) return bestlen, results
>>> x = "bcdabcdbcd" >>> crunch4(x) # finds repeated bcd at end (6, [(4, 3, 2)]) >>> crunch4a(x) # finds nothing (0, [])
bcd bcdabcdbcd bcdbcd
# only look at adjacent entries - fast, but sometimes wrong def crunch4a(s): from sa import suffix_array sa, rank, lcp = suffix_array(s) bestlen, results = 0, [] n = len(s) for sai in range(1, n): i, j = sa[sai - 1], sa[sai] c = lcp[sai] w = abs(i - j) if c >= w: numreps = 1 + c // w total = w * numreps if total >= bestlen: if total > bestlen: results.clear() bestlen = total results.append((min(i, j), w, numreps)) return bestlen, results
# Generate lcp intervals from the lcp array. def genlcpi(lcp): lcp.append(0) stack = [(0, 0)] for i in range(1, len(lcp)): c = lcp[i] lb = i - 1 while c < stack[-1][0]: i_c, lb = stack.pop() interval = i_c, lb, i - 1 yield interval if c > stack[-1][0]: stack.append((c, lb)) lcp.pop() def crunch9(text): from sa import suffix_array sa, rank, lcp = suffix_array(text) bestlen, results = 0, [] n = len(text) # generate branching tandem repeats def gen_btr(text=text, n=n, sa=sa): for c, lb, rb in genlcpi(lcp): i = sa[lb] basic = text[i : i + c] # Binary searches to find subrange beginning with # basic+basic. A more gonzo implementation would do this # character by character, never materialzing the common # prefix in `basic`. rb += 1 hi = rb while lb < hi: # like bisect.bisect_left mid = (lb + hi) // 2 i = sa[mid] + c if text[i : i + c] < basic: lb = mid + 1 else: hi = mid lo = lb while lo < rb: # like bisect.bisect_right mid = (lo + rb) // 2 i = sa[mid] + c if basic < text[i : i + c]: rb = mid else: lo = mid + 1 lead = basic[0] for sai in range(lb, rb): i = sa[sai] j = i + 2*c assert j <= n if j < n and text[j] == lead: continue # it's non-branching yield (i, c, 2) for start, c, _ in gen_btr(): # extend left numreps = 2 for i in range(start - c, -1, -c): if all(text[i+k] == text[start+k] for k in range(c)): start = i numreps += 1 else: break totallen = c * numreps if totallen < bestlen: continue if totallen > bestlen: bestlen = totallen results.clear() results.append((start, c, numreps)) # add non-branches while start: if text[start - 1] == text[start + c - 1]: start -= 1 results.append((start, c, numreps)) else: break return bestlen, results
assert all(rank[sa[i]] == sa[rank[i]] == i for i in range(len(sa)))
# Generate lcp intervals from the lcp array. def genlcpi(lcp): lcp.append(0) stack = [(0, 0)] for i in range(1, len(lcp)): c = lcp[i] lb = i - 1 while c < stack[-1][0]: i_c, lb = stack.pop() yield (i_c, lb, i) if c > stack[-1][0]: stack.append((c, lb)) lcp.pop() def crunch11(text): from sa import suffix_array sa, rank, lcp = suffix_array(text) bestlen, results = 0, [] n = len(text) # Generate branching tandem repeats. # (i, c, 2) is branching tandem iff # i+c in interval with prefix text[i : i+c], and # i+c not in subinterval with prefix text[i : i+c + 1] # Caution: this pragmatically relies on that, in Python 3, # `range()` returns a tiny object with O(1) membership testing. # In Python 2 it returns a list - ahould still work, but very # much slower. def gen_btr(text=text, n=n, sa=sa, rank=rank): from itertools import chain for c, lb, rb in genlcpi(lcp): origlb, origrb = lb, rb origrange = range(lb, rb) i = sa[lb] lead = text[i] # Binary searches to find subrange beginning with # text[i : i+c+1]. Note we take slices of length 1 # rather than just index to avoid special-casing for # i >= n. # A more elaborate traversal of the lcp array could also # give us a list of child intervals, and then we'd just # need to pick the right one. But that would be even # more hairy code, and unclear to me it would actually # help the worst cases (yes, the interval can be large, # but so can a list of child intervals). hi = rb while lb < hi: # like bisect.bisect_left mid = (lb + hi) // 2 i = sa[mid] + c if text[i : i+1] < lead: lb = mid + 1 else: hi = mid lo = lb while lo < rb: # like bisect.bisect_right mid = (lo + rb) // 2 i = sa[mid] + c if lead < text[i : i+1]: rb = mid else: lo = mid + 1 subrange = range(lb, rb) if 2 * len(subrange) <= len(origrange): # Subrange is at most half the size. # Iterate over it to find candidates i, starting # with wa. If i+c is also in origrange, but not # in subrange, good: then i is of the form wwx. for sai in subrange: i = sa[sai] ic = i + c if ic < n: r = rank[ic] if r in origrange and r not in subrange: yield (i, c, 2, subrange) else: # Iterate over the parts outside subrange instead. # Candidates i are then the trailing wx in the # hoped-for wwx. We win if i-c is in subrange too # (or, for that matter, if it's in origrange). for sai in chain(range(origlb, lb), range(rb, origrb)): ic = sa[sai] - c if ic >= 0 and rank[ic] in subrange: yield (ic, c, 2, subrange) for start, c, numreps, irange in gen_btr(): # extend left crange = range(start - c, -1, -c) if (numreps + len(crange)) * c < bestlen: continue for i in crange: if rank[i] in irange: start = i numreps += 1 else: break # check for best totallen = c * numreps if totallen < bestlen: continue if totallen > bestlen: bestlen = totallen results.clear() results.append((start, c, numreps)) # add non-branches while start and text[start - 1] == text[start + c - 1]: start -= 1 results.append((start, c, numreps)) return bestlen, results