Python 查找字符串中最重复（不是最常见）序列的算法（也称串联重复）_Python_Regex_Python 3.x_String_Algorithm

Python 查找字符串中最重复（不是最常见）序列的算法（也称串联重复）

python regex python-3.x string algorithm

Python 查找字符串中最重复（不是最常见）序列的算法（也称串联重复）,python,regex,python-3.x,string,algorithm,Python,Regex,Python 3.x,String,Algorithm,我正在寻找一种算法（可能是用Python实现的），能够找到字符串中最重复的序列。其中，对于重复，我指的是不间断地反复重复的任何字符组合（串联重复）我正在寻找的算法与“查找最常见的单词”的算法不同。事实上，重复块不需要是字符串中最常见的字（子字符串）例如： s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs' > f(s) 'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'

我正在寻找一种算法（可能是用Python实现的），能够找到字符串中最重复的序列。其中，对于重复，我指的是不间断地反复重复的任何字符组合（串联重复）

我正在寻找的算法与“查找最常见的单词”的算法不同。事实上，重复块不需要是字符串中最常见的字（子字符串）

例如：

s = 'asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs'
> f(s)
'UBAUBAUBAUBAUBA' #the "most common word" algo would return 'BA'

不幸的是，我不知道如何解决这个问题。欢迎任何帮助

更新

有一个额外的例子来说明我希望返回重复次数最多的序列，不管它的基本构造块是什么

g = 'some noisy spacer'
s = g + 'AB'*5 + g + '_ABCDEF'*2 + g + 'AB'*3
> f(s)
'ABABABABAB' #the one with the most repetitions, not the max len

@rici的示例：

s = 'aaabcabc'
> f(s)
'abcabc'

s = 'ababcababc'
> f(s)
'ababcababc' #'abab' would also be a solution here
             # since it is repeated 2 times in a row as 'ababcababc'.
             # The proper algorithm would return both solutions.

结合使用

re.findall（）

（使用特定的正则表达式模式）和

max（）

函数：

import re

#  extended sample string
s = 'asdfewfUBAUBAUBAUBAUBAasdkjnfencsADADADAD sometext'

def find_longest_rep(s):
    result = max(re.findall(r'((\w+?)\2+)', s), key=lambda t: len(t[0]))
    return result[0]

print(find_longest_rep(s))

输出：

UBAUBAUBAUBAUBA

关键模式：

```
（（\w+？）\2+）
```
：
- ```
（..）
```
  -最外层的捕获组，即第一个捕获组
- ```
（\w+？）
```
  -包含在第二个捕获组中的任何非空白字符序列<代码>+？-量词，匹配一次和无限次之间，尽可能少的次数，根据需要展开
- ```
\2+
```
  -与第二个捕获组最近匹配的文本相同

以下是基于

（\w+？）\2+

正则表达式的解决方案，但有其他改进：

import re
from itertools import chain


def repetitive(sequence, rep_min_len=1):
    """Find the most repetitive sequence in a string.

    :param str sequence: string for search
    :param int rep_min_len: minimal length of repetitive substring
    :return the most repetitive substring or None
    """
    greedy, non_greedy = re.compile(r'((\w+)\2+)'), re.compile(r'((\w+?)\2+)')

    all_rep_seach = lambda regex: \
        (regex.search(sequence[shift:]) for shift in range(len(sequence)))

    searched = list(
        res.groups()
        for res in chain(all_rep_seach(greedy), all_rep_seach(non_greedy))
        if res)

    if not sequence:
        return None

    cmp_key = lambda res: res[0].count(res[1]) if len(res[1]) >= rep_min_len else 0
    return max(searched, key=cmp_key)[0]

您可以这样进行测试：

def check(seq, expected, rep_min_len=1):
    result = repetitive(seq, rep_min_len)
    print('%s => %s' % (seq, result))
    assert result == expected, expected


check('asdfewfUBAUBAUBAUBAUBAasdkBAjnfBAenBAcs', 'UBAUBAUBAUBAUBA')
check('some noisy spacerABABABABABsome noisy spacer_ABCDEF_ABCDEFsome noisy spacerABABAB', 'ABABABABAB')
check('aaabcabc', 'aaa')
check('aaabcabc', 'abcabc', rep_min_len=2)
check('ababcababc', 'ababcababc')
check('ababcababcababc', 'ababcababcababc')

主要特点：

使用贪婪

（（\w+）\2+）

和非贪婪

（（\w+）\2+？）

regex

在所有子字符串中搜索重复的子字符串，从开头开始移位（例如，“字符串”=>[“字符串”、“字符串”、“环”、“ing”、“ng”、“g”）

选择基于重复次数而不是子序列的长度（例如，“ABABABAB_ABCDEF_ABCDEF”的结果将是“ababababab”，而不是“ababcdef_ABCDEF”）

重复序列的最小长度为（参见“aaabcabc”检查）

您要搜索的是一种算法，用于查找字符串中的“最大”原始串联重复。本文描述了一种线性时间算法，用于查找字符串中的所有串联重复，并扩展了所有原始串联重复

这是我写的蛮力算法。也许它会有用：

def find_most_repetitive_substring(string):
max_counter = 1
position, substring_length, times = 0, 0, 0
for i in range(len(string)):
    for j in range(len(string) - i):
        counter = 1
        if j == 0:
            continue
        while True:
            if string[i + counter * j: i + (counter + 1) * j] != string[i: i + j] or i + (counter + 1) * j > len(string):
                if counter > max_counter:
                    max_counter = counter
                    position, substring_length, times = i, j, counter
                break
            else:
                counter += 1
return string[position: position + substring_length * times]

重复，你是指至少出现两次的子字符串吗？看看后缀树和Ukkonen的算法可能重复最长的重复或最长的重复？@Oliver Longest Repeation。你能分享一下程序的预期时间复杂性吗？很好的解决方案！谢谢在接受答案之前，我将用一些示例对其进行测试。以下是两个失败的字符串示例：

longest_rep（'aaabcabc'）=>'aaa'

（应该是

'abcabc'

）

longest_rep（'ababcbabc'）=>'abab'

（应该是

'abababcbc'

）@rici你说得对，我在我的帖子中添加了一个更有意义的例子。但是，此解决方案仍然非常有用。

re.findall（r'（\w+？）\2+），s）

找到所有重复的块，获得重复次数更多的块是小菜一碟。如果今天没有更好的结果出现，我觉得这个答案应该被接受，因为它给了我一个非常有用的建议，告诉我如何继续执行我正在寻找的算法。你知道我在哪里可以找到它的实现吗？嗨，我正在尝试你的解决方案，但是，在任何输入下，

结果总是没有的。