查找Python最长重复字符串的有效方法（来自编程Pearls）_Python_C_Suffix Tree_Suffix Array_Programming Pearls

查找Python最长重复字符串的有效方法（来自编程Pearls）

python c

查找Python最长重复字符串的有效方法（来自编程Pearls）,python,c,suffix-tree,suffix-array,programming-pearls,Python,C,Suffix Tree,Suffix Array,Programming Pearls,摘自《编程珍珠》第15.2节可在此处查看C代码：当我使用后缀数组在Python中实现它时： example = open("iliad10.txt").read() def comlen(p, q): i = 0 for x in zip(p, q): if x[0] == x[1]: i += 1 else: break return i suffix_list = [] exampl

摘自《编程珍珠》第15.2节

可在此处查看C代码：

当我使用后缀数组在Python中实现它时：

example = open("iliad10.txt").read()
def comlen(p, q):
    i = 0
    for x in zip(p, q):
        if x[0] == x[1]:
            i += 1
        else:
            break
    return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:]))  #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
    this_len = comlen(example[idx[i]:], example[idx[i+1]:])
    print this_len
    if this_len > max_len:
        max_len = this_len
        maxi = i

我发现

idx.sort

步骤非常慢。我认为这很慢，因为Python需要通过值而不是指针传递子字符串（如上面的C代码）

测试文件可从以下位置下载：

C代码只需0.3秒即可完成

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away. 

real    0m0.328s
user    0m0.291s
sys 0m0.006s

但对于Python代码，它永远不会在我的计算机上结束（我等了10分钟就把它杀掉了）

有人知道如何使代码高效吗？（例如，不到10秒）

主要问题似乎是python按副本进行切片：

你将不得不使用一个替代来获得一个引用，而不是一个副本。当我这样做时，程序在

idx.sort

函数（非常快）之后挂起
我相信只要做一点工作，你就能把剩下的工作做好
编辑：
由于
cmp
的工作方式与
strcmp
的工作方式不同，因此上述更改将不起到直接替换的作用。例如，请尝试以下C代码：

#include <stdio.h> #include <string.h> int main() { char* test1 = "ovided by The Internet Classics Archive"; char* test2 = "rovided by The Internet Classics Archive."; printf("%d\n", strcmp(test1, test2)); }
C代码在我的机器上打印
-3
，而python版本打印
-1
。看起来示例
C
code滥用了
strcmp
的返回值（毕竟它用于
qsort
）。我找不到任何关于
strcmp
何时将返回除
[-1,0,1]
以外的其他内容的文档，但是在原始代码中将
printf
添加到
pstrcmp
中显示了许多超出该范围的值（3，-31,5是前3个值）
为了确保
-3
不是错误代码，如果我们反转test1和test2，我们将得到
3
编辑：
以上是一些有趣的琐事，但在影响代码块方面实际上并不正确。我意识到这一点，就在我关闭笔记本电脑并离开wifi区域时。。。我真的应该在点击
保存之前仔细检查所有内容 FWIW，cmp 当然可以在memoryview 对象上工作（按预期打印-1 ）：我不知道为什么代码没有按预期工作。在我的机器上打印列表看起来不像预期的那样。我将对此进行研究，并尝试找到更好的解决方案，而不是抓住救命稻草。此版本在我的circa-2007桌面上使用完全不同的算法大约需要17秒： #!/usr/bin/env python ex = open("iliad.mb.txt").read() chains = dict() # populate initial chains dictionary for (a,b) in enumerate(zip(ex,ex[1:])) : s = ''.join(b) if s not in chains : chains[s] = list() chains[s].append(a) def grow_chains(chains) : new_chains = dict() for (string,pos) in chains : offset = len(string) for p in pos : if p + offset >= len(ex) : break # add one more character s = string + ex[p + offset] if s not in new_chains : new_chains[s] = list() new_chains[s].append(p) return new_chains # grow and filter, grow and filter while len(chains) > 1 : print 'length of chains', len(chains) # remove chains that appear only once chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1] print 'non-unique chains', len(chains) print [i[0] for i in chains[:3]] chains = grow_chains(chains) 其基本思想是创建一个子字符串及其出现位置的列表，从而消除了反复比较相同字符串的需要。结果列表看起来像[（'indhim，but'，[466548739011]），（'bullwark bot'，[428251428924]），（'his armar'，[12155919191932851393566，413634718953760088]）。将删除唯一字符串。然后每个列表成员增加1个字符，并创建新的列表。将再次删除唯一字符串。等等… 将算法翻译成Python：从itertools导入imap、izip、星图、tee 从os.path导入公共前缀 def成对（ITEROABLE）：#itertools配方 a、 b=三通（可调）下一个（b，无）返回izip（a，b） def最长\u重复\u小（数据）：后缀=已排序（数据[i:]表示内存中xrange中的i（len（data））#O（n*n）返回最大值（imap（公共前缀，成对（后缀）），key=len）允许在不复制的情况下获取子字符串： def最长的重复缓冲区（数据）： n=len（数据） sa=排序（xrange（n），key=lambda i:buffer（data，i））#后缀数组 def lcp_项（i，j）：#查找最长公共前缀数组项开始=i 当i 在我的机器上花了5秒钟的时间原则上，可以在O（n）时间和O（n）内存中使用带a的增广a来查找副本注意：*\u memoryview（）已被*\u buffer（）版本弃用内存效率更高的版本（与最长的\u duplicate\u small（）相比）： def cmp_存储器视图（a、b）：对于izip（a，b）中的x，y：如果xy：返回1 返回cmp（透镜（a），透镜（b）） def common_prefix_memoryview（（a，b））：对于枚举中的i，（x，y）（izip（a，b））：如果x！=y: 返回a[：i] 如果len（a）在我的机器上运行iliad.mb.txt 需要17秒。结果是： On this the rest of the Achaeans with one voice were for respecting the priest and taking the ransom that he offered; but not so Agamemnon, who spoke fiercely to him and sent him roughly away. 相关问题：我的解决方案基于后缀数组。它由最长公共前缀的前缀加倍构成。最坏情况的复杂性为O（n（logn）^2）。任务“iliad.mb.txt”在我的笔记本电脑上需要4秒钟。代码在函数后缀数组和最长公共子字符串中有很好的文档记录。后一个函数很短，可以很容易地修改，例如用于搜索10个最长的非重叠子串。如果重复字符串的长度超过10000个字符，则此Python代码比问题中的更快 from itertools import groupby from operator import itemgetter def longest_common_substring(text): """Get the longest common substrings and their positions. >>> longest_common_substring('banana') {'ana': [1, 3]} >>> text = "not so Agamemnon, who spoke fiercely to " >>> sorted(longest_common_substring(text).items()) [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])] This function can be easy modified for any criteria, e.g. for searching ten longest non overlapping repeated substrings. """ sa, rsa, lcp = suffix_array(text) maxlen = max(lcp) result = {} for i in range(1, len(text)): if lcp[i] == maxlen: j1, j2, h = sa[i - 1], sa[i], lcp[i] assert text[j1:j1 + h] == text[j2:j2 + h] substring = text[j1:j1 + h] if not substring in result: result[substring] = [j1] result[substring].append(j2) return dict((k, sorted(v)) for k, v in result.items()) def suffix_array(text, _step=16): """Analyze all common strings in the text. Short substrings of the length _step a are first pre-sorted. The are the results repeatedly merged so that the garanteed number of compared characters bytes is doubled in every iteration until all substrings are sorted exactly. Arguments: text: The text to be analyzed. _step: Is only for optimization and testing. It is the optimal length of substrings used for initial pre-sorting. The bigger value is faster if there is enough memory. Memory requirements are approximately (estimate for 32 bit Python 3.3): len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB Return value: (tuple) (sa, rsa, lcp) sa: Suffix array for i in range(1, size): assert text[sa[i-1]:] < text[sa[i]:] rsa: Reverse suffix array for i in range(size): assert rsa[sa[i]] == i lcp: Longest common prefix for i in range(1, size): assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]] if sa[i-1] + lcp[i] < len(text): assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]] >>> suffix_array(text='banana') ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2]) Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana' The Longest Common String is 'ana': lcp[2] == 3 == len('ana') It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:] """ tx = text size = len(tx) step = min(max(_step, 1), len(tx)) sa = list(range(len(tx))) sa.sort(key=lambda i: tx[i:i + step]) grpstart = size * [False] + [True] # a boolean map for iteration speedup. # It helps to skip yet resolved values. The last value True is a sentinel. rsa = size * [None] stgrp, igrp = '', 0 for i, pos in enumerate(sa): st = tx[pos:pos + step] if st != stgrp: grpstart[igrp] = (igrp < i - 1) stgrp = st igrp = i rsa[pos] = igrp sa[i] = pos grpstart[igrp] = (igrp < size - 1 or size == 0) while grpstart.index(True) < size: # assert step <= size nextgr = grpstart.index(True) while nextgr < size: igrp = nextgr nextgr = grpstart.index(True, igrp + 1) glist = [] for ig in range(igrp, nextgr): pos = sa[ig] if rsa[pos] != igrp: break newgr = rsa[pos + step] if pos + step < size else -1 glist.append((newgr, pos)) glist.sort() for ig, g in groupby(glist, key=itemgetter(0)): g = [x[1] for x in g] sa[igrp:igrp + len(g)] = g grpstart[igrp] = (len(g) > 1) for pos in g: rsa[pos] = igrp igrp += len(g) step *= 2 del grpstart # create LCP array lcp = size * [None] h = 0 for i in range(size): if rsa[i] > 0: j = sa[rsa[i] - 1] while i != size - h and j != size - h and tx[i + h] == tx[j + h]: h += 1 lcp[rsa[i]] = h if h > 0: h -= 1 if size > 0: lcp[0] = 0 return sa, rsa, lcp 从itertools导入groupby 从运算符导入itemgetter def最长_公共_子字符串（文本）： “”“获取最长的公共子字符串及其位置。”。 >>>最长的公共子字符串（“香蕉”） {'ana'：[1,3]} >>>text=“不是这样的阿伽门农，他对我说话很凶” >>>已排序（最长\u公共\u子字符串（文本）.items（）） [（'s'，[3,21]），（'no'，[0] On this the rest of the Achaeans with one voice were for respecting the priest and taking the ransom that he offered; but not so Agamemnon, who spoke fiercely to him and sent him roughly away. from itertools import groupby from operator import itemgetter def longest_common_substring(text): """Get the longest common substrings and their positions. >>> longest_common_substring('banana') {'ana': [1, 3]} >>> text = "not so Agamemnon, who spoke fiercely to " >>> sorted(longest_common_substring(text).items()) [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])] This function can be easy modified for any criteria, e.g. for searching ten longest non overlapping repeated substrings. """ sa, rsa, lcp = suffix_array(text) maxlen = max(lcp) result = {} for i in range(1, len(text)): if lcp[i] == maxlen: j1, j2, h = sa[i - 1], sa[i], lcp[i] assert text[j1:j1 + h] == text[j2:j2 + h] substring = text[j1:j1 + h] if not substring in result: result[substring] = [j1] result[substring].append(j2) return dict((k, sorted(v)) for k, v in result.items()) def suffix_array(text, _step=16): """Analyze all common strings in the text. Short substrings of the length _step a are first pre-sorted. The are the results repeatedly merged so that the garanteed number of compared characters bytes is doubled in every iteration until all substrings are sorted exactly. Arguments: text: The text to be analyzed. _step: Is only for optimization and testing. It is the optimal length of substrings used for initial pre-sorting. The bigger value is faster if there is enough memory. Memory requirements are approximately (estimate for 32 bit Python 3.3): len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB Return value: (tuple) (sa, rsa, lcp) sa: Suffix array for i in range(1, size): assert text[sa[i-1]:] < text[sa[i]:] rsa: Reverse suffix array for i in range(size): assert rsa[sa[i]] == i lcp: Longest common prefix for i in range(1, size): assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]] if sa[i-1] + lcp[i] < len(text): assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]] >>> suffix_array(text='banana') ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2]) Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana' The Longest Common String is 'ana': lcp[2] == 3 == len('ana') It is between tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:] """ tx = text size = len(tx) step = min(max(_step, 1), len(tx)) sa = list(range(len(tx))) sa.sort(key=lambda i: tx[i:i + step]) grpstart = size * [False] + [True] # a boolean map for iteration speedup. # It helps to skip yet resolved values. The last value True is a sentinel. rsa = size * [None] stgrp, igrp = '', 0 for i, pos in enumerate(sa): st = tx[pos:pos + step] if st != stgrp: grpstart[igrp] = (igrp < i - 1) stgrp = st igrp = i rsa[pos] = igrp sa[i] = pos grpstart[igrp] = (igrp < size - 1 or size == 0) while grpstart.index(True) < size: # assert step <= size nextgr = grpstart.index(True) while nextgr < size: igrp = nextgr nextgr = grpstart.index(True, igrp + 1) glist = [] for ig in range(igrp, nextgr): pos = sa[ig] if rsa[pos] != igrp: break newgr = rsa[pos + step] if pos + step < size else -1 glist.append((newgr, pos)) glist.sort() for ig, g in groupby(glist, key=itemgetter(0)): g = [x[1] for x in g] sa[igrp:igrp + len(g)] = g grpstart[igrp] = (len(g) > 1) for pos in g: rsa[pos] = igrp igrp += len(g) step *= 2 del grpstart # create LCP array lcp = size * [None] h = 0 for i in range(size): if rsa[i] > 0: j = sa[rsa[i] - 1] while i != size - h and j != size - h and tx[i + h] == tx[j + h]: h += 1 lcp[rsa[i]] = h if h > 0: h -= 1 if size > 0: lcp[0] = 0 return sa, rsa, lcp