Python 查找文件中最常见的子字符串模式_Python_String_Algorithm_Data Structures_Substring

Python 查找文件中最常见的子字符串模式

python string algorithm data-structures

Python 查找文件中最常见的子字符串模式,python,string,algorithm,data-structures,substring,Python,String,Algorithm,Data Structures,Substring,您将获得一个字符串，如： input_string = """ HIYourName=this is not true HIYourName=Have a good day HIYourName=nope HIYourName=Bye!""" 查找文件中最常见的子字符串。这里的答案是“HiYourName=”。请注意，最具挑战性的部分是HiYourName=本身不是字符串中的“单词” i、 e.其周围没有间隔因此，澄清一下，这不是最常见的单词问题。您可以在线性时间内用字符串构建后缀树或后

您将获得一个字符串，如：

input_string = """
HIYourName=this is not true
HIYourName=Have a good day
HIYourName=nope
HIYourName=Bye!"""

查找文件中最常见的子字符串。这里的答案是“HiYourName=”。请注意，最具挑战性的部分是HiYourName=本身不是字符串中的“单词” i、 e.其周围没有间隔

因此，澄清一下，这不是最常见的单词问题。

您可以在线性时间内用字符串构建后缀树或后缀数组（请参见其中的和链接），然后在构建后缀树后，还可以通过线性时间内的深度优先搜索来计算后缀前缀的数量（子字符串出现的次数）对于线性时间内所有最长的子字符串，并将此信息存储在后缀树中的每个节点上。然后，您只需搜索树以查找子字符串的最大出现次数（线性时间），然后返回出现最大次数（也是线性时间）的最长子字符串。

下面是一个简单的暴力解决方案：

from collections import Counter

s = " HIYourName=this is not true HIYourName=Have a good day HIYourName=nope HIYourName=Bye!"
for n in range(1, len(s)):
    substr_counter = Counter(s[i: i+n] for i in range(len(s) - n))
    phrase, count = substr_counter.most_common(1)[0]
    if count == 1:      # early out for trivial cases
        break
    print 'Size: %3d:  Occurrences: %3d  Phrase: %r' % (n, count, phrase)

示例字符串的输出为：

Size:   1:  Occurrences:  10  Phrase: ' '
Size:   2:  Occurrences:   4  Phrase: 'Na'
Size:   3:  Occurrences:   4  Phrase: 'Nam'
Size:   4:  Occurrences:   4  Phrase: 'ourN'
Size:   5:  Occurrences:   4  Phrase: 'HIYou'
Size:   6:  Occurrences:   4  Phrase: 'IYourN'
Size:   7:  Occurrences:   4  Phrase: 'urName='
Size:   8:  Occurrences:   4  Phrase: ' HIYourN'
Size:   9:  Occurrences:   4  Phrase: 'HIYourNam'
Size:  10:  Occurrences:   4  Phrase: ' HIYourNam'
Size:  11:  Occurrences:   4  Phrase: ' HIYourName'
Size:  12:  Occurrences:   4  Phrase: ' HIYourName='
Size:  13:  Occurrences:   2  Phrase: 'e HIYourName='

另一种没有进口的暴力：

s = """ HIYourName=this is not true HIYourName=Have a good day HIYourName=nope HIYourName=Bye!"""

def conseq_sequences(li):
    seq = []
    maxi = max(s.split(),key=len) # max possible string cannot span across spaces in the string
    for i in range(2, len(maxi)+ 1): # get all substrings from 2 to max possible length
        seq += ["".join(x) for x in (zip(*(li[i:] for i in range(i)))) if " " not in x]
    return max([x  for x in seq if seq.count(x) > 1],key=len) # get longest len string that appears more than once
print conseq_sequences(s)
HIYourName=

这个问题和我们的情况完全一样

解决方案是使用后缀数组或后缀树并使用rmq。

Aha！是的，子字符串应该超过最小长度，比如至少6个字符。那么，现在我们手头有一个编程问题。但是请注意，无论您给出的字符串长度是多少，返回的子字符串都将恰好是该长度，或者返回的子字符串之一将是。你应该为此做计划。