如何计算python3中大文本中排列（重叠）的出现次数？_Python_Arrays_String_Algorithm_Count

如何计算python3中大文本中排列（重叠）的出现次数？

python arrays string algorithm

如何计算python3中大文本中排列（重叠）的出现次数？,python,arrays,string,algorithm,count,Python,Arrays,String,Algorithm,Count,我有一个单词列表，我想知道每个排列在这个单词列表中发生了多少次。我还想计算重叠排列。所以count（）似乎不合适。例如：排列aba在此字符串中出现两次：亚贝巴但是count（）会说一个所以我设计了这个小脚本，但我不太确定它是否有效。word数组是一个外部文件，我只是删除了这一部分以使其更简单 import itertools import itertools #Occurence counting function def occ(string, sub): count

我有一个单词列表，我想知道每个排列在这个单词列表中发生了多少次。我还想计算重叠排列。所以count（）似乎不合适。例如：排列aba在此字符串中出现两次：

亚贝巴

但是count（）会说一个

所以我设计了这个小脚本，但我不太确定它是否有效。word数组是一个外部文件，我只是删除了这一部分以使其更简单

import itertools
import itertools



#Occurence counting function
def occ(string, sub):
    count = start = 0
    while True:
        start = string.find(sub, start) + 1
        if start > 0:
            count+=1
        else:
            return count


#permutation generator
abc="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
permut = [''.join(p) for p in itertools.product(abc,repeat=2)]


#Transform osd7 in array

arrayofWords=['word1',"word2","word3","word4"]


dict_output['total']=0

#create the array
for perm in permut:
    dict_output[perm]=0

#iterate over the arrayofWords and permutation
for word in arrayofWords:
    for perm in permut:
        dict_output[perm]=dict_output[perm]+occ(word,perm)
        dict_output['total']=dict_output['total']+occ(word,perm)

它正在工作，但需要很长时间。如果我改变，产品（abc，重复=2）按产品（abc，重复=3）或产品（abc，重复=4）。。。这将需要整整一周的时间

问题：有没有更有效的方法？

您可以使用

re

模块计算重叠匹配

import re
print len(re.findall(r'(?=(aba))','ababa'))

输出：

更一般地说

print len(re.findall(r'(?=(<pattern>))','<input_string>'))

print len（关于findall（r'（？=（））'，“”））

非常简单：只计算需要计算的内容

from collections import defaultdict

quadrigrams = defaultdict(lambda: 0)    
for word in arrayofWords:
    for i in range(len(word) - 3):
        quadrigrams[word[i:i+4]] += 1

那么你现在的问题是什么？它不工作吗？不，它工作得很好。我只是想知道是否有一种更有效的方法来计算你也在寻找代词，对吗？将逻辑颠倒过来，只获取

len=3的所有子字符串，然后将0分配给其他所有子字符串，怎么样？您的dict将以您拥有的方式难以置信地稀疏。例如，让我们以'ababa'
为例。len=3
的现有子字符串是['aba'，'bab']
。所有其他的（26！/（23！*3！）-2=2598）
都不存在。所以你可以有一个dict，比如occ={'aba'：2，'bab'：1}
，如果没有在occ:return 0

中的键，那么其他所有内容都将通过

返回0，你的意思是我可以将每个单词散列成长度为3的较小片段吗？例如：ababa=>abababab-aba？这可以取代occ
函数，从而获得一些时间，但为了看到真正的区别，我建议完全改变逻辑