查找Python最长重复字符串的有效方法(来自编程Pearls)

查找Python最长重复字符串的有效方法(来自编程Pearls),python,c,suffix-tree,suffix-array,programming-pearls,Python,C,Suffix Tree,Suffix Array,Programming Pearls,摘自《编程珍珠》第15.2节 可在此处查看C代码: 当我使用后缀数组在Python中实现它时: example = open("iliad10.txt").read() def comlen(p, q): i = 0 for x in zip(p, q): if x[0] == x[1]: i += 1 else: break return i suffix_list = [] exampl

摘自《编程珍珠》第15.2节

可在此处查看C代码:

当我使用后缀数组在Python中实现它时:

example = open("iliad10.txt").read()
def comlen(p, q):
    i = 0
    for x in zip(p, q):
        if x[0] == x[1]:
            i += 1
        else:
            break
    return i

suffix_list = []
example_len = len(example)
idx = list(range(example_len))
idx.sort(cmp = lambda a, b: cmp(example[a:], example[b:]))  #VERY VERY SLOW

max_len = -1
for i in range(example_len - 1):
    this_len = comlen(example[idx[i]:], example[idx[i+1]:])
    print this_len
    if this_len > max_len:
        max_len = this_len
        maxi = i
我发现
idx.sort
步骤非常慢。我认为这很慢,因为Python需要通过值而不是指针传递子字符串(如上面的C代码)

测试文件可从以下位置下载:

C代码只需0.3秒即可完成

time cat iliad10.txt |./longdup 
On this the rest of the Achaeans with one voice were for
respecting the priest and taking the ransom that he offered; but
not so Agamemnon, who spoke fiercely to him and sent him roughly
away. 

real    0m0.328s
user    0m0.291s
sys 0m0.006s
但对于Python代码,它永远不会在我的计算机上结束(我等了10分钟就把它杀掉了)


有人知道如何使代码高效吗?(例如,不到10秒)

主要问题似乎是python按副本进行切片:

你将不得不使用一个替代来获得一个引用,而不是一个副本。当我这样做时,程序在
idx.sort
函数(非常快)之后挂起

我相信只要做一点工作,你就能把剩下的工作做好

编辑:

由于
cmp
的工作方式与
strcmp
的工作方式不同,因此上述更改将不起到直接替换的作用。例如,请尝试以下C代码:

#include <stdio.h>
#include <string.h>

int main() {
    char* test1 = "ovided by The Internet Classics Archive";
    char* test2 = "rovided by The Internet Classics Archive.";
    printf("%d\n", strcmp(test1, test2));
}
C代码在我的机器上打印
-3
,而python版本打印
-1
。看起来示例
C
code滥用了
strcmp
的返回值(毕竟它用于
qsort
)。我找不到任何关于
strcmp
何时将返回除
[-1,0,1]
以外的其他内容的文档,但是在原始代码中将
printf
添加到
pstrcmp
中显示了许多超出该范围的值(3,-31,5是前3个值)

为了确保
-3
不是错误代码,如果我们反转test1和test2,我们将得到
3

编辑:

以上是一些有趣的琐事,但在影响代码块方面实际上并不正确。我意识到这一点,就在我关闭笔记本电脑并离开wifi区域时。。。我真的应该在点击
保存之前仔细检查所有内容

FWIW,
cmp
当然可以在
memoryview
对象上工作(按预期打印
-1
):


我不知道为什么代码没有按预期工作。在我的机器上打印列表看起来不像预期的那样。我将对此进行研究,并尝试找到更好的解决方案,而不是抓住救命稻草。

此版本在我的circa-2007桌面上使用完全不同的算法大约需要17秒:

#!/usr/bin/env python

ex = open("iliad.mb.txt").read()

chains = dict()

# populate initial chains dictionary
for (a,b) in enumerate(zip(ex,ex[1:])) :
    s = ''.join(b)
    if s not in chains :
        chains[s] = list()

    chains[s].append(a)

def grow_chains(chains) :
    new_chains = dict()
    for (string,pos) in chains :
        offset = len(string)
        for p in pos :
            if p + offset >= len(ex) : break

            # add one more character
            s = string + ex[p + offset]

            if s not in new_chains :
                new_chains[s] = list()

            new_chains[s].append(p)
    return new_chains

# grow and filter, grow and filter
while len(chains) > 1 :
    print 'length of chains', len(chains)

    # remove chains that appear only once
    chains = [(i,chains[i]) for i in chains if len(chains[i]) > 1]

    print 'non-unique chains', len(chains)
    print [i[0] for i in chains[:3]]

    chains = grow_chains(chains)

其基本思想是创建一个子字符串及其出现位置的列表,从而消除了反复比较相同字符串的需要。结果列表看起来像
[('indhim,but',[466548739011]),('bullwark bot',[428251428924]),('his armar',[12155919191932851393566,413634718953760088])
。将删除唯一字符串。然后每个列表成员增加1个字符,并创建新的列表。将再次删除唯一字符串。等等…

将算法翻译成Python:

从itertools导入imap、izip、星图、tee
从os.path导入公共前缀
def成对(ITEROABLE):#itertools配方
a、 b=三通(可调)
下一个(b,无)
返回izip(a,b)
def最长\u重复\u小(数据):
后缀=已排序(数据[i:]表示内存中xrange中的i(len(data))#O(n*n)
返回最大值(imap(公共前缀,成对(后缀)),key=len)
允许在不复制的情况下获取子字符串:

def最长的重复缓冲区(数据):
n=len(数据)
sa=排序(xrange(n),key=lambda i:buffer(data,i))#后缀数组
def lcp_项(i,j):#查找最长公共前缀数组项
开始=i
当i
在我的机器上花了5秒钟的时间

原则上,可以在O(n)时间和O(n)内存中使用带a的增广a来查找副本


注意:
*\u memoryview()
已被
*\u buffer()
版本弃用

内存效率更高的版本(与最长的\u duplicate\u small()相比):

def cmp_存储器视图(a、b):
对于izip(a,b)中的x,y:
如果xy:
返回1
返回cmp(透镜(a),透镜(b))
def common_prefix_memoryview((a,b)):
对于枚举中的i,(x,y)(izip(a,b)):
如果x!=y:
返回a[:i]
如果len(a)
在我的机器上运行
iliad.mb.txt
需要17秒。结果是:

On this the rest of the Achaeans with one voice were for respecting the priest and taking the ransom that he offered; but not so Agamemnon, who spoke fiercely to him and sent him roughly away. 相关问题:

我的解决方案基于后缀数组。它由最长公共前缀的前缀加倍构成。最坏情况的复杂性为O(n(logn)^2)。任务“iliad.mb.txt”在我的笔记本电脑上需要4秒钟。代码在函数
后缀数组
最长公共子字符串
中有很好的文档记录。后一个函数很短,可以很容易地修改,例如用于搜索10个最长的非重叠子串。如果重复字符串的长度超过10000个字符,则此Python代码比问题中的更快

from itertools import groupby
from operator import itemgetter

def longest_common_substring(text):
    """Get the longest common substrings and their positions.
    >>> longest_common_substring('banana')
    {'ana': [1, 3]}
    >>> text = "not so Agamemnon, who spoke fiercely to "
    >>> sorted(longest_common_substring(text).items())
    [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]

    This function can be easy modified for any criteria, e.g. for searching ten
    longest non overlapping repeated substrings.
    """
    sa, rsa, lcp = suffix_array(text)
    maxlen = max(lcp)
    result = {}
    for i in range(1, len(text)):
        if lcp[i] == maxlen:
            j1, j2, h = sa[i - 1], sa[i], lcp[i]
            assert text[j1:j1 + h] == text[j2:j2 + h]
            substring = text[j1:j1 + h]
            if not substring in result:
                result[substring] = [j1]
            result[substring].append(j2)
    return dict((k, sorted(v)) for k, v in result.items())

def suffix_array(text, _step=16):
    """Analyze all common strings in the text.

    Short substrings of the length _step a are first pre-sorted. The are the 
    results repeatedly merged so that the garanteed number of compared
    characters bytes is doubled in every iteration until all substrings are
    sorted exactly.

    Arguments:
        text:  The text to be analyzed.
        _step: Is only for optimization and testing. It is the optimal length
               of substrings used for initial pre-sorting. The bigger value is
               faster if there is enough memory. Memory requirements are
               approximately (estimate for 32 bit Python 3.3):
                   len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB

    Return value:      (tuple)
      (sa, rsa, lcp)
        sa:  Suffix array                  for i in range(1, size):
               assert text[sa[i-1]:] < text[sa[i]:]
        rsa: Reverse suffix array          for i in range(size):
               assert rsa[sa[i]] == i
        lcp: Longest common prefix         for i in range(1, size):
               assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
               if sa[i-1] + lcp[i] < len(text):
                   assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
    >>> suffix_array(text='banana')
    ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])

    Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
    The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
    It is between  tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
    """
    tx = text
    size = len(tx)
    step = min(max(_step, 1), len(tx))
    sa = list(range(len(tx)))
    sa.sort(key=lambda i: tx[i:i + step])
    grpstart = size * [False] + [True]  # a boolean map for iteration speedup.
    # It helps to skip yet resolved values. The last value True is a sentinel.
    rsa = size * [None]
    stgrp, igrp = '', 0
    for i, pos in enumerate(sa):
        st = tx[pos:pos + step]
        if st != stgrp:
            grpstart[igrp] = (igrp < i - 1)
            stgrp = st
            igrp = i
        rsa[pos] = igrp
        sa[i] = pos
    grpstart[igrp] = (igrp < size - 1 or size == 0)
    while grpstart.index(True) < size:
        # assert step <= size
        nextgr = grpstart.index(True)
        while nextgr < size:
            igrp = nextgr
            nextgr = grpstart.index(True, igrp + 1)
            glist = []
            for ig in range(igrp, nextgr):
                pos = sa[ig]
                if rsa[pos] != igrp:
                    break
                newgr = rsa[pos + step] if pos + step < size else -1
                glist.append((newgr, pos))
            glist.sort()
            for ig, g in groupby(glist, key=itemgetter(0)):
                g = [x[1] for x in g]
                sa[igrp:igrp + len(g)] = g
                grpstart[igrp] = (len(g) > 1)
                for pos in g:
                    rsa[pos] = igrp
                igrp += len(g)
        step *= 2
    del grpstart
    # create LCP array
    lcp = size * [None]
    h = 0
    for i in range(size):
        if rsa[i] > 0:
            j = sa[rsa[i] - 1]
            while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
                h += 1
            lcp[rsa[i]] = h
            if h > 0:
                h -= 1
    if size > 0:
        lcp[0] = 0
    return sa, rsa, lcp
从itertools导入groupby
从运算符导入itemgetter
def最长_公共_子字符串(文本):
“”“获取最长的公共子字符串及其位置。”。
>>>最长的公共子字符串(“香蕉”)
{'ana':[1,3]}
>>>text=“不是这样的阿伽门农,他对我说话很凶”
>>>已排序(最长\u公共\u子字符串(文本).items())
[('s',[3,21]),('no',[0]
On this the rest of the Achaeans with one voice were for respecting
the priest and taking the ransom that he offered; but not so Agamemnon,
who spoke fiercely to him and sent him roughly away.
from itertools import groupby
from operator import itemgetter

def longest_common_substring(text):
    """Get the longest common substrings and their positions.
    >>> longest_common_substring('banana')
    {'ana': [1, 3]}
    >>> text = "not so Agamemnon, who spoke fiercely to "
    >>> sorted(longest_common_substring(text).items())
    [(' s', [3, 21]), ('no', [0, 13]), ('o ', [5, 20, 38])]

    This function can be easy modified for any criteria, e.g. for searching ten
    longest non overlapping repeated substrings.
    """
    sa, rsa, lcp = suffix_array(text)
    maxlen = max(lcp)
    result = {}
    for i in range(1, len(text)):
        if lcp[i] == maxlen:
            j1, j2, h = sa[i - 1], sa[i], lcp[i]
            assert text[j1:j1 + h] == text[j2:j2 + h]
            substring = text[j1:j1 + h]
            if not substring in result:
                result[substring] = [j1]
            result[substring].append(j2)
    return dict((k, sorted(v)) for k, v in result.items())

def suffix_array(text, _step=16):
    """Analyze all common strings in the text.

    Short substrings of the length _step a are first pre-sorted. The are the 
    results repeatedly merged so that the garanteed number of compared
    characters bytes is doubled in every iteration until all substrings are
    sorted exactly.

    Arguments:
        text:  The text to be analyzed.
        _step: Is only for optimization and testing. It is the optimal length
               of substrings used for initial pre-sorting. The bigger value is
               faster if there is enough memory. Memory requirements are
               approximately (estimate for 32 bit Python 3.3):
                   len(text) * (29 + (_size + 20 if _size > 2 else 0)) + 1MB

    Return value:      (tuple)
      (sa, rsa, lcp)
        sa:  Suffix array                  for i in range(1, size):
               assert text[sa[i-1]:] < text[sa[i]:]
        rsa: Reverse suffix array          for i in range(size):
               assert rsa[sa[i]] == i
        lcp: Longest common prefix         for i in range(1, size):
               assert text[sa[i-1]:sa[i-1]+lcp[i]] == text[sa[i]:sa[i]+lcp[i]]
               if sa[i-1] + lcp[i] < len(text):
                   assert text[sa[i-1] + lcp[i]] < text[sa[i] + lcp[i]]
    >>> suffix_array(text='banana')
    ([5, 3, 1, 0, 4, 2], [3, 2, 5, 1, 4, 0], [0, 1, 3, 0, 0, 2])

    Explanation: 'a' < 'ana' < 'anana' < 'banana' < 'na' < 'nana'
    The Longest Common String is 'ana': lcp[2] == 3 == len('ana')
    It is between  tx[sa[1]:] == 'ana' < 'anana' == tx[sa[2]:]
    """
    tx = text
    size = len(tx)
    step = min(max(_step, 1), len(tx))
    sa = list(range(len(tx)))
    sa.sort(key=lambda i: tx[i:i + step])
    grpstart = size * [False] + [True]  # a boolean map for iteration speedup.
    # It helps to skip yet resolved values. The last value True is a sentinel.
    rsa = size * [None]
    stgrp, igrp = '', 0
    for i, pos in enumerate(sa):
        st = tx[pos:pos + step]
        if st != stgrp:
            grpstart[igrp] = (igrp < i - 1)
            stgrp = st
            igrp = i
        rsa[pos] = igrp
        sa[i] = pos
    grpstart[igrp] = (igrp < size - 1 or size == 0)
    while grpstart.index(True) < size:
        # assert step <= size
        nextgr = grpstart.index(True)
        while nextgr < size:
            igrp = nextgr
            nextgr = grpstart.index(True, igrp + 1)
            glist = []
            for ig in range(igrp, nextgr):
                pos = sa[ig]
                if rsa[pos] != igrp:
                    break
                newgr = rsa[pos + step] if pos + step < size else -1
                glist.append((newgr, pos))
            glist.sort()
            for ig, g in groupby(glist, key=itemgetter(0)):
                g = [x[1] for x in g]
                sa[igrp:igrp + len(g)] = g
                grpstart[igrp] = (len(g) > 1)
                for pos in g:
                    rsa[pos] = igrp
                igrp += len(g)
        step *= 2
    del grpstart
    # create LCP array
    lcp = size * [None]
    h = 0
    for i in range(size):
        if rsa[i] > 0:
            j = sa[rsa[i] - 1]
            while i != size - h and j != size - h and tx[i + h] == tx[j + h]:
                h += 1
            lcp[rsa[i]] = h
            if h > 0:
                h -= 1
    if size > 0:
        lcp[0] = 0
    return sa, rsa, lcp