Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/334.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/algorithm/10.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 查找最长的相邻重复非重叠子字符串_Python_Algorithm_Language Agnostic_Substring_String Algorithm - Fatal编程技术网

Python 查找最长的相邻重复非重叠子字符串

Python 查找最长的相邻重复非重叠子字符串,python,algorithm,language-agnostic,substring,string-algorithm,Python,Algorithm,Language Agnostic,Substring,String Algorithm,(这个问题与音乐无关,但我以音乐为例 (一个用例。) 在音乐中,短语结构的一种常见方式是按音符序列 中间部分重复一次或多次。因此,这个短语 由引言、循环部分和输出部分组成。这里有一个 例如: 我们可以“看到”介绍是[E],重复部分是[F G A] F]输出为[cd]。因此,拆分列表的方法是 [ [ E E E ] 3 [ F G A F ] [ C D ] ] 其中第一项是简介,第二项是 重复部分重复,第三部分输出 我需要一个算法来执行这样的拆分 但有一点需要注意,那就是可能有多种方法 拆分列

(这个问题与音乐无关,但我以音乐为例 (一个用例。)

在音乐中,短语结构的一种常见方式是按音符序列 中间部分重复一次或多次。因此,这个短语 由引言、循环部分和输出部分组成。这里有一个 例如:

我们可以“看到”介绍是[E],重复部分是[F G A] F]输出为[cd]。因此,拆分列表的方法是

[ [ E E E ] 3 [ F G A F ] [ C D ] ]
其中第一项是简介,第二项是 重复部分重复,第三部分输出

我需要一个算法来执行这样的拆分

但有一点需要注意,那就是可能有多种方法 拆分列表。例如,上述列表可分为:

[ [ E E E F G A ] 2 [ F F G A ] [ F C D ] ]
但这是一个更糟糕的分裂,因为介绍和介绍更长。所以 该算法的标准是找到最大化 环件的长度,并使环件的组合长度最小 开场白和开场白。这意味着正确的分割

[ A C C C C C C C C C A ]

因为导入和导出的组合长度是2,而 循环部分的长度为9

此外,虽然intro和outro可以是空的,但只有“true”重复是空的 允许。因此,不允许进行以下拆分:

[ [ ] 1 [ E E E F G A F F G A F F G A F C D ] [ ] ]
可以将其视为为为数据找到最佳的“压缩” 序列请注意,在某些序列中可能没有任何重复:

[ A B C D ]
对于这些退化情况,任何合理的结果都是允许的

以下是我对算法的实现:

def find_longest_repeating_non_overlapping_subseq(seq):
    candidates = []
    for i in range(len(seq)):
        candidate_max = len(seq[i + 1:]) // 2
        for j in range(1, candidate_max + 1):
            candidate, remaining = seq[i:i + j], seq[i + j:]
            n_reps = 1
            len_candidate = len(candidate)
            while remaining[:len_candidate] == candidate:
                n_reps += 1
                remaining = remaining[len_candidate:]
            if n_reps > 1:
                candidates.append((seq[:i], n_reps,
                                   candidate, remaining))
    if not candidates:
        return (type(seq)(), 1, seq, type(seq)())

    def score_candidate(candidate):
        intro, reps, loop, outro = candidate
        return reps - len(intro) - len(outro)
    return sorted(candidates, key = score_candidate)[-1]
我不确定它是否正确,但它通过了我做的简单测试 描述。问题是,这是一种缓慢的方式。我已经看过了 在后缀树上,但它们似乎不适合我的用例,因为
我要找的子字符串应该是不重叠和相邻的。

看起来你要做的几乎就是压缩算法。您可以对照我链接到的维基百科文章中的参考实现检查代码。

以下是我对您所说内容的实现。它与您的非常相似,但它跳过了子字符串,这些子字符串已被检查为先前子字符串的重复

from collections import namedtuple
SubSequence = namedtuple('SubSequence', ['start', 'length', 'reps'])

def longest_repeating_subseq(original: str):
    winner = SubSequence(start=0, length=0, reps=0)
    checked = set()
    subsequences = (  # Evaluates lazily during iteration
        SubSequence(start=start, length=length, reps=1)
        for start in range(len(original))
        for length in range(1, len(original) - start)
        if (start, length) not in checked)

    for s in subsequences:
        subseq = original[s.start : s.start + s.length]
        for reps, next_start in enumerate(
                range(s.start + s.length, len(original), s.length),
                start=1):
            if subseq != original[next_start : next_start + s.length]:
                break
            else:
                checked.add((next_start, s.length))

        s = s._replace(reps=reps)
        if s.reps > 1 and (
                (s.length * s.reps > winner.length * winner.reps)
                or (  # When total lengths are equal, prefer the shorter substring
                    s.length * s.reps == winner.length * winner.reps
                    and s.reps > winner.reps)):
            winner = s

    # Check for default case with no repetitions
    if winner.reps == 0:
        winner = SubSequence(start=0, length=len(original), reps=1)

    return (
        original[ : winner.start],
        winner.reps,
        original[winner.start : winner.start + winner.length],
        original[winner.start + winner.length * winner.reps : ])

def test(seq, *, expect):
    print(f'Testing longest_repeating_subseq for {seq}')
    result = longest_repeating_subseq(seq)
    print(f'Expected {expect}, got {result}')
    print(f'Test {"passed" if result == expect else "failed"}')
    print()

if __name__ == '__main__':
    test('EEEFGAFFGAFFGAFCD', expect=('EEE', 3, 'FGAF', 'CD'))
    test('ACCCCCCCCCA', expect=('A', 9, 'C', 'A'))
    test('ABCD', expect=('', 1, 'ABCD', ''))

把你的三个例子都传给我。这似乎是一种可能有很多奇怪的边缘情况的事情,但考虑到这是一种优化的暴力,它可能更像是更新规范的问题,而不是修复代码本身的错误。

这里有一种方法显然是二次时间,但常数因子相对较低,因为除了长度为1的子字符串对象外,它不构建任何子字符串对象。结果是一个2元组

bestlen, list_of_results
其中,
bestlen
是重复相邻块的最长子串的长度,每个结果是一个3元组

start_index, width, numreps
这意味着正在重复的子字符串是

the_string[start_index : start_index + width]
并且有那些相邻的
numreps
。永远都是这样

bestlen == width * numreps
问题描述留下了歧义。例如,考虑这个输出:

>>> crunch2("aaaaaabababa")
(6, [(0, 1, 6), (0, 2, 3), (5, 2, 3), (6, 2, 3), (0, 3, 2)])
因此,它找到了5种方法来将“最长”的拉伸视为长度6:

  • 首字母“a”重复6次
  • 首字母“aa”重复3次
  • “ab”最左边的实例重复了3次
  • “ba”最左边的实例重复了3次
  • 最初的“aaa”重复了2次
它不返回intro或outro,因为从它返回的内容中可以推断出它们是微不足道的:

  • 介绍是
    字符串[:start\u index]
  • outro是字符串[start\u index+bestlen:][/code>
如果没有重复的相邻块,则返回

(0, [])
其他示例(来自您的帖子):

其工作原理的关键:假设每个相邻重复块的宽度
W
。然后考虑当将原始字符串与左边移动的代码进行比较时,代码<>代码<>代码>:< /p>
... block1 block2 ... blockN-1 blockN ...
... block2 block3 ... blockN      ... ...
然后在相同的位置获得连续的相等字符。但这也适用于另一个方向:如果向左移动
W
并找到
(N-1)*W
连续相等的字符,则可以推断:

block1 == block2
block2 == block3
...
blockN-1 == blockN
因此,所有
N
块必须是block1的重复

因此,代码重复地将原始字符串左移(副本)一个字符,然后在这两个字符串上从左向右移动,以识别相等字符的最长长度。这只需要一次比较一对字符。为了使“左移”有效(恒定时间),字符串的副本存储在
collections.deque

编辑:
update()
在许多情况下做了太多无用的工作;换了它

def crunch2(s):
    from collections import deque

    # There are zcount equal characters starting
    # at index starti.
    def update(starti, zcount):
        nonlocal bestlen
        while zcount >= width:
            numreps = 1 + zcount // width
            count = width * numreps
            if count >= bestlen:
                if count > bestlen:
                    results.clear()
                results.append((starti, width, numreps))
                bestlen = count
            else:
                break
            zcount -= 1
            starti += 1

    bestlen, results = 0, []
    t = deque(s)
    for width in range(1, len(s) // 2 + 1):
        t.popleft()
        zcount = 0
        for i, (a, b) in enumerate(zip(s, t)):
            if a == b:
                if not zcount: # new run starts here
                    starti = i
                zcount += 1
            # else a != b, so equal run (if any) ended
            elif zcount:
                update(starti, zcount)
                zcount = 0
        if zcount:
            update(starti, zcount)
    return bestlen, results
使用regexp [由于尺寸限制,已删除此项]

使用后缀数组 这是迄今为止我发现的最快的,尽管仍然可以激发成二次时间行为

请注意,是否找到重叠字符串并不重要。正如上面对
crunch2()
程序所解释的(这里以次要方式详述):

  • 给定字符串
    s
    ,长度
    n=len(s)
  • 给定具有
    0堆栈[-1][0]的int
    i
    j
    : stack.append((c,lb)) lcp.pop() 第11段(文本): 从sa导入后缀_数组 sa,秩,lcp=后缀_数组(文本) 最佳,结果=0,[] n=len(文本) #生成分支串联重复。 #(i,c,2)是iff #带前缀文本[i:i+c]的间隔i+c,以及 #i+c不在前缀文本为[i:i+c+1]的子区间中 #注意:这实际上依赖于,在Python 3中, #`range()`返回一个具有O(1)成员身份测试的小对象。 #在Python2中,它返回一个列表——a仍然有效,但非常有用 #慢多了。 def gen_btr(text=text,n=n,sa=sa,rank=rank): 来自itertools进口链 对于genlcpi(lcp)中的c、lb、rb:
    ... block1 block2 ... blockN-1 blockN ...
    ... block2 block3 ... blockN      ... ...
    
    block1 == block2
    block2 == block3
    ...
    blockN-1 == blockN
    
    def crunch2(s):
        from collections import deque
    
        # There are zcount equal characters starting
        # at index starti.
        def update(starti, zcount):
            nonlocal bestlen
            while zcount >= width:
                numreps = 1 + zcount // width
                count = width * numreps
                if count >= bestlen:
                    if count > bestlen:
                        results.clear()
                    results.append((starti, width, numreps))
                    bestlen = count
                else:
                    break
                zcount -= 1
                starti += 1
    
        bestlen, results = 0, []
        t = deque(s)
        for width in range(1, len(s) // 2 + 1):
            t.popleft()
            zcount = 0
            for i, (a, b) in enumerate(zip(s, t)):
                if a == b:
                    if not zcount: # new run starts here
                        starti = i
                    zcount += 1
                # else a != b, so equal run (if any) ended
                elif zcount:
                    update(starti, zcount)
                    zcount = 0
            if zcount:
                update(starti, zcount)
        return bestlen, results
    
    def crunch4(s):
        from sa import suffix_array
        sa, rank, lcp = suffix_array(s)
        bestlen, results = 0, []
        n = len(s)
        for sai in range(n-1):
            i = sa[sai]
            c = n
            for saj in range(sai + 1, n):
                c = min(c, lcp[saj])
                if not c:
                    break
                j = sa[saj]
                w = abs(i - j)
                if c < w:
                    continue
                numreps = 1 + c // w
                assert numreps > 1
                total = w * numreps
                if total >= bestlen:
                    if total > bestlen:
                        results.clear()
                        bestlen = total
                    results.append((min(i, j), w, numreps))
        return bestlen, results
    
    >>> len(xs)
    209755
    >>> xs.count('\n')
    25481
    
    >>> crunch2(xs)
    (44, [(63750, 22, 2)])
    >>> xs[63750 : 63750+50]
    '\nelectroencephalograph\nelectroencephalography\nelec'
    
    >>> crunch3(xs)
    (8, [(19308, 4, 2), (47240, 4, 2)])
    >>> xs[19308 : 19308+10]
    'beriberi\nB'
    >>> xs[47240 : 47240+10]
    'couscous\nc'
    
    >>> crunch3(xs) # with DOTALL
    (44, [(63750, 22, 2)])
    
    >>> crunch4(xs)
    (44, [(63750, 22, 2)])
    
    'x' * 1000000
    
    def crunch5(text):
        from collections import namedtuple, defaultdict
    
        # For all integers i and j in IxSet x.s,
        # text[i : i + x.w] == text[j : j + x.w].
        # That is, it's the set of all indices at which a specific
        # substring of length x.w is found.
        # In general, we only care about repeated substrings here,
        # so weed out those that would otherwise have len(x.s) == 1.
        IxSet = namedtuple("IxSet", "s w")
    
        bestlen, results = 0, []
    
        # Compute sets of indices for repeated (not necessarily
        # adjacent!) substrings of length xs[0].w + ys[0].w, by looking
        # at the cross product of the index sets in xs and ys.
        def combine(xs, ys):
            xw, yw = xs[0].w, ys[0].w
            neww = xw + yw
            result = []
            for y in ys:
                shifted = set(i - xw for i in y.s if i >= xw)
                for x in xs:
                    ok = shifted & x.s
                    if len(ok) > 1:
                        result.append(IxSet(ok, neww))
            return result
    
        # Check an index set for _adjacent_ repeated substrings.
        def check(s):
            nonlocal bestlen
            x, w = s.s.copy(), s.w
            while x:
                current = start = x.pop()
                count = 1
                while current + w in x:
                    count += 1
                    current += w
                    x.remove(current)
                while start - w in x:
                    count += 1
                    start -= w
                    x.remove(start)
                if count > 1:
                    total = count * w
                    if total >= bestlen:
                        if total > bestlen:
                            results.clear()
                            bestlen = total
                        results.append((start, w, count))
    
        ch2ixs = defaultdict(set)
        for i, ch in enumerate(text):
            ch2ixs[ch].add(i)
        size1 = [IxSet(s, 1)
                 for s in ch2ixs.values()
                 if len(s) > 1]
        del ch2ixs
        for x in size1:
            check(x)
    
        current_class = size1
        # Repeatedly increase size by 1 until current_class becomes
        # empty. At that point, there are no repeated substrings at all
        # (adjacent or not) of the then-current size (or larger).
        while current_class:
            current_class = combine(current_class, size1)
            for x in current_class:
                check(x)
        
        return bestlen, results
    
    def crunch6(text):
        from sa import suffix_array
        sa, rank, lcp = suffix_array(text)
        bestlen, results = 0, []
        n = len(text)
    
        # Generate maximal sets of indices s such that for all i and j
        # in s the suffixes starting at s[i] and s[j] start with a
        # common prefix of at least len minc.
        def genixs(minc, sa=sa, lcp=lcp, n=n):
            i = 1
            while i < n:
                c = lcp[i]
                if c < minc:
                    i += 1
                    continue
                ixs = {sa[i-1], sa[i]}
                i += 1
                while i < n:
                    c = min(c, lcp[i])
                    if c < minc:
                        yield ixs
                        i += 1
                        break
                    else:
                        ixs.add(sa[i])
                        i += 1
                else: # ran off the end of lcp
                    yield ixs
    
        # Check an index set for _adjacent_ repeated substrings
        # w apart.  CAUTION: this empties s.
        def check(s, w):
            nonlocal bestlen
            while s:
                current = start = s.pop()
                count = 1
                while current + w in s:
                    count += 1
                    current += w
                    s.remove(current)
                while start - w in s:
                    count += 1
                    start -= w
                    s.remove(start)
                if count > 1:
                    total = count * w
                    if total >= bestlen:
                        if total > bestlen:
                            results.clear()
                            bestlen = total
                        results.append((start, w, count))
    
        c = 0
        found = True
        while found:
            c += 1
            found = False
            for s in genixs(c):
                found = True
                check(s, c)
        return bestlen, results
    
    >>> x = "bcdabcdbcd"
    >>> crunch4(x)  # finds repeated bcd at end
    (6, [(4, 3, 2)])
    >>> crunch4a(x) # finds nothing
    (0, [])
    
    bcd
    bcdabcdbcd
    bcdbcd
    
    # only look at adjacent entries - fast, but sometimes wrong
    def crunch4a(s):
        from sa import suffix_array
        sa, rank, lcp = suffix_array(s)
        bestlen, results = 0, []
        n = len(s)
        for sai in range(1, n):
            i, j = sa[sai - 1], sa[sai]
            c = lcp[sai]
            w = abs(i - j)
            if c >= w:
                numreps = 1 + c // w
                total = w * numreps
                if total >= bestlen:
                    if total > bestlen:
                        results.clear()
                        bestlen = total
                    results.append((min(i, j), w, numreps))
        return bestlen, results
    
    # Generate lcp intervals from the lcp array.
    def genlcpi(lcp):
        lcp.append(0)
        stack = [(0, 0)]
        for i in range(1, len(lcp)):
            c = lcp[i]
            lb = i - 1
            while c < stack[-1][0]:
                i_c, lb = stack.pop()
                interval = i_c, lb, i - 1
                yield interval
            if c > stack[-1][0]:
                stack.append((c, lb))
        lcp.pop()
    
    def crunch9(text):
        from sa import suffix_array
    
        sa, rank, lcp = suffix_array(text)
        bestlen, results = 0, []
        n = len(text)
    
        # generate branching tandem repeats
        def gen_btr(text=text, n=n, sa=sa):
            for c, lb, rb in genlcpi(lcp):
                i = sa[lb]
                basic = text[i : i + c]
                # Binary searches to find subrange beginning with
                # basic+basic. A more gonzo implementation would do this
                # character by character, never materialzing the common
                # prefix in `basic`.
                rb += 1
                hi = rb
                while lb < hi:  # like bisect.bisect_left
                    mid = (lb + hi) // 2
                    i = sa[mid] + c
                    if text[i : i + c] < basic:
                        lb = mid + 1
                    else:
                        hi = mid
                lo = lb
                while lo < rb:  # like bisect.bisect_right
                    mid = (lo + rb) // 2
                    i = sa[mid] + c
                    if basic < text[i : i + c]:
                        rb = mid
                    else:
                        lo = mid + 1
                lead = basic[0]
                for sai in range(lb, rb):
                    i = sa[sai]
                    j = i + 2*c
                    assert j <= n
                    if j < n and text[j] == lead:
                        continue # it's non-branching
                    yield (i, c, 2)
    
        for start, c, _ in gen_btr():
            # extend left
            numreps = 2
            for i in range(start - c, -1, -c):
                if all(text[i+k] == text[start+k] for k in range(c)):
                    start = i
                    numreps += 1
                else:
                    break
            totallen = c * numreps
            if totallen < bestlen:
                continue
            if totallen > bestlen:
                bestlen = totallen
                results.clear()
            results.append((start, c, numreps))
            # add non-branches
            while start:
                if text[start - 1] == text[start + c - 1]:
                    start -= 1
                    results.append((start, c, numreps))
                else:
                    break
        return bestlen, results
    
    assert all(rank[sa[i]] == sa[rank[i]] == i for i in range(len(sa)))
    
    # Generate lcp intervals from the lcp array.
    def genlcpi(lcp):
        lcp.append(0)
        stack = [(0, 0)]
        for i in range(1, len(lcp)):
            c = lcp[i]
            lb = i - 1
            while c < stack[-1][0]:
                i_c, lb = stack.pop()
                yield (i_c, lb, i)
            if c > stack[-1][0]:
                stack.append((c, lb))
        lcp.pop()
    
    def crunch11(text):
        from sa import suffix_array
    
        sa, rank, lcp = suffix_array(text)
        bestlen, results = 0, []
        n = len(text)
    
        # Generate branching tandem repeats.
        # (i, c, 2) is branching tandem iff
        #     i+c in interval with prefix text[i : i+c], and
        #     i+c not in subinterval with prefix text[i : i+c + 1]
        # Caution: this pragmatically relies on that, in Python 3,
        # `range()` returns a tiny object with O(1) membership testing.
        # In Python 2 it returns a list - ahould still work, but very
        # much slower.
        def gen_btr(text=text, n=n, sa=sa, rank=rank):
            from itertools import chain
    
            for c, lb, rb in genlcpi(lcp):
                origlb, origrb = lb, rb
                origrange = range(lb, rb)
                i = sa[lb]
                lead = text[i]
                # Binary searches to find subrange beginning with
                # text[i : i+c+1]. Note we take slices of length 1
                # rather than just index to avoid special-casing for
                # i >= n.
                # A more elaborate traversal of the lcp array could also
                # give us a list of child intervals, and then we'd just
                # need to pick the right one. But that would be even
                # more hairy code, and unclear to me it would actually
                # help the worst cases (yes, the interval can be large,
                # but so can a list of child intervals).
                hi = rb
                while lb < hi:  # like bisect.bisect_left
                    mid = (lb + hi) // 2
                    i = sa[mid] + c
                    if text[i : i+1] < lead:
                        lb = mid + 1
                    else:
                        hi = mid
                lo = lb
                while lo < rb:  # like bisect.bisect_right
                    mid = (lo + rb) // 2
                    i = sa[mid] + c
                    if lead < text[i : i+1]:
                        rb = mid
                    else:
                        lo = mid + 1
                subrange = range(lb, rb)
                if 2 * len(subrange) <= len(origrange):
                    # Subrange is at most half the size.
                    # Iterate over it to find candidates i, starting
                    # with wa.  If i+c is also in origrange, but not
                    # in subrange, good:  then i is of the form wwx.
                    for sai in subrange:
                        i = sa[sai]
                        ic = i + c
                        if ic < n:
                            r = rank[ic]
                            if r in origrange and r not in subrange:
                                yield (i, c, 2, subrange)
                else:
                    # Iterate over the parts outside subrange instead.
                    # Candidates i are then the trailing wx in the
                    # hoped-for wwx. We win if i-c is in subrange too
                    # (or, for that matter, if it's in origrange).
                    for sai in chain(range(origlb, lb),
                                     range(rb, origrb)):
                        ic = sa[sai] - c
                        if ic >= 0 and rank[ic] in subrange:
                            yield (ic, c, 2, subrange)
    
        for start, c, numreps, irange in gen_btr():
            # extend left
            crange = range(start - c, -1, -c)
            if (numreps + len(crange)) * c < bestlen:
                continue
            for i in crange:
                if rank[i] in irange:
                    start = i
                    numreps += 1
                else:
                    break
            # check for best
            totallen = c * numreps
            if totallen < bestlen:
                continue
            if totallen > bestlen:
                bestlen = totallen
                results.clear()
            results.append((start, c, numreps))
            # add non-branches
            while start and text[start - 1] == text[start + c - 1]:
                    start -= 1
                    results.append((start, c, numreps))
        return bestlen, results