在python中搜索给定字符串的超弦的字符串列表的最快方法
我正在做一个项目,需要对照一个非常大的字符串列表检查一个字符串,搜索字符串是列表中某个元素的子字符串的情况 我最初有这样一种方法:在python中搜索给定字符串的超弦的字符串列表的最快方法,python,string,list,Python,String,List,我正在做一个项目,需要对照一个非常大的字符串列表检查一个字符串,搜索字符串是列表中某个元素的子字符串的情况 我最初有这样一种方法: def isSubstring(subWord, words): for superWord in words: if superWord.find(subWord) != -1 and len(subWord) != len(superWord): return True return Fa
def isSubstring(subWord, words):
for superWord in words:
if superWord.find(subWord) != -1 and len(subWord) != len(superWord):
return True
return False
def checkForSubstrings(words):
words.sort(key=len, reverse=False)
while len(words) > 1:
currentWord = words.pop(0)
if isSubstring(currentWord, words):
print("%s is a substring of some other string" % currentWord)
按长度对所有字符串排序,对于每个单词,仅将其与较长的单词进行比较
但这种方法有一个缺陷,即在列表排序过程中,仍将单词与任意放置在其后面的相同长度的单词进行比较
因此,我更改了checkForSubstring
方法:
def checkForSubstring(words):
sameLengthWordsLists = [[w for w in words if len(w) == num] for num in set(len(i) for i in words)]
for wordList in sameLengthWordsLists:
words = words[len(wordList):]
if len(words) == 0:
break
for currentWord in wordList:
if isSubsumed(currentWord, words):
print("%s is a substring of some other string" % currentWord)
这个版本没有按长度排序,而是按长度将字符串列表拆分为多个列表,然后对照每个较大单词列表检查每个列表。这解决了前面的问题
但速度并没有明显加快,有人能建议一种更快的方法吗?目前,这是一个瓶颈。如果大的字符串列表没有那么大,您可以构建一个大的dict,其中包含所有可能的连续子字符串。利用索引dict的优点,每个后续搜索的时间复杂度将下降到O(1),这可能会大大加快速度 下面是我的示例代码:
# -*- coding: utf-8 -*-
import sys
from collections import defaultdict
text = """Sort all the strings by length, for each word, compare it only to the longer words.
But this method has a flaw in that words are still being compared to words of the same length which are arbitrarily placed after it during the list sort.
So I changed the "checkForSubstring" method:"""
def checkForSubstrings(words):
# Building a big dict first, this may be a little slow and cosuming a lot memory
d = defaultdict(set)
for windex, word in enumerate(words):
# Get all possible substrings of word
for i in range(len(word)):
for j in range(len(word)):
if word[i:j+1]:
# Put (word_index, matches_whole) to our dict
d[word[i:j+1]].add((windex, word[i:j+1] == word))
# You may call sys.getsizeof(d) to check memory usage
# import sys; print sys.getsizeof(d)
# Iter over words, find matches bug ignore the word itself
for windex, word in enumerate(words):
matches = d.get(word, [])
for obj in matches:
if not obj[1]:
print("%s is a substring of some other string" % word)
break
if __name__ == '__main__':
words = text.lower().split()
checkForSubstrings(words)
此脚本的结果:
sort is a substring of some other string
for is a substring of some other string
compare is a substring of some other string
it is a substring of some other string
method is a substring of some other string
a is a substring of some other string
in is a substring of some other string
words is a substring of some other string
are is a substring of some other string
words is a substring of some other string
length is a substring of some other string
are is a substring of some other string
it is a substring of some other string
so is a substring of some other string
i is a substring of some other string
我的评论如下:
def checkForSubstrings(words):
# e.g: fo: [foo, foobar]
super_strings = defaultdict(list)
# e.g: foo: [fo, oo]
substrings = defaultdict(list)
words.sort(key=len, reverse=True)
while words:
# Nota: pop(0) is highly inefficient, as it moves all the list
word = words.pop()
subwords = substrings[word]
# finding the smallest list of words that contain a substring of `word`
current_words = min(super_strings[w] for w in subwords, key=len)
if not current_words:
current_words = words
super_words = [w for w in current_words if len(w) > len(word) and w.find(word) > -1]
for s in super_words:
substrings[s].append(word)
super_strings[word] = super_words
# the result is in super_strings
如果没有两个单词是子字符串,或者所有单词都是子字符串,那么这不会改变任何东西。然而,如果只有一些是,它应该加快一些事情。使用
pop()
而不是pop(0)
你看了吗?你的输入到底是什么:一个单词列表还是一个单词列表?这里有一个例子:如果a是B的子字符串,C是一组字符串,比如a是C的每个元素的子字符串,(B在C中),那么B的所有超字符串都在C中。(例如:A=foo,B=foobar,C=[foobar,foobarbaz,foobaz])。因此,通过按升序排列,您可以首先查看以前是否有任何字符串被认为是当前字符串的子字符串。您只需测试它们。words=words[len(wordList):]
不应该起作用,因为您正在替换变量words
,所以索引是错误的。另外,由于set
未排序,所以整个操作都不起作用,所以SameleLengthWordsList不按单词在words
中的顺序包含单词list@schwobaseggl:input是一个单词列表eresting。我正试图让它工作,但无法,因为行current_words=min(super_strings[w]表示子词中的w,key=len)
。你能稍微澄清一下那一行的意思吗?就这一点而言,当前单词
应该做什么?我看不到它在任何地方被使用。应该在以后使用,而不是单词。修复它now@EmmetOT因为在foobar
和foobaz
(并且仅在这些子字符串中),foobar
仅在这些子字符串中搜索,而不是在所有单词中搜索。如果之前找到了多个已知子字符串(例如,foo
和bar
),我们将查找最小的子字符串集