Python 我们如何在两个字符串之间找到常用词？_Python_String_Algorithm_Data Structures_Trie

Python 我们如何在两个字符串之间找到常用词？

python string algorithm data-structures

Python 我们如何在两个字符串之间找到常用词？,python,string,algorithm,data-structures,trie,Python,String,Algorithm,Data Structures,Trie,假设我们有两个字符串，我们需要在这两个字符串之间找到常用词 str1 = "hit hop hat" str2 = "hot has hit hop" output = ["hit", "hop"] 我知道我们可以用一个简单的方法把字符串分开，把单词放在集合中，然后取交叉点。我的问题是我们如何优化空间？如果许多字符串都有一个共同的前缀怎么办？这里有一种解决这个问题的方法，从较小的单词列表中创建一个简化的trie，

假设我们有两个字符串，我们需要在这两个字符串之间找到常用词

str1 = "hit hop hat"
str2 = "hot has hit hop"

output = ["hit", "hop"]

我知道我们可以用一个简单的方法把字符串分开，把单词放在集合中，然后取交叉点。我的问题是我们如何优化空间？如果许多字符串都有一个共同的前缀怎么办？

这里有一种解决这个问题的方法，从较小的单词列表中创建一个简化的trie，然后搜索较长列表中每个单词的匹配项：

def create_simplified_trie(words):
    trie = {}
    for word in words:
        curr = trie
        for c in word:
            if c not in curr:
                curr[c] = {}
            curr = curr[c]
        # Mark the end of a word
        curr['#'] = True  
    return trie

str1 = "hit hop hat"
str2 = "hot has hit hop"
words1 = str1.split()
words2 = str2.split()
# Ensure words1 is the smaller length list
if len(words1) > len(words2):
    words1, words2 = words2, words1

words1_trie = create_simplified_trie(words1)

output = []
for word in words2:
    curr = words1_trie
    found_prefix = True
    for c in word:
        if c not in curr:
            found_prefix = False
            break
        curr = curr[c]
    if found_prefix and '#' in curr:
        output.append(word)

print(output)

输出：

['hit', 'hop']

一种简单的方法是使用一组单词及其交叉点，如下所示：

>>> str1 = "hit hop hat"
>>> str2 = "hot has hit hop"

>>> set_of_words1=set( str1.split() )
>>> set_of_words2=set( str2.split() )

>>> set_of_words1 & set_of_words2
{'hop', 'hit'}

没有明显的理由去优化任何东西，但是如果你希望实现一个trie，你能解释一下它是如何工作的吗？比如说在一根针上做一个trie，在另一根针上搜索单词？是的，我想这是一个不错的选择。但在哪一点上，它实际上比使用集合更快，这是一个大问题。至于空间，如果你有一个比可用RAM大的文件作为输入，那么它可能会有意义。@zone，是的，有意义，但我在想，如果我们并行地对每个句子进行两次尝试，并比较这些尝试会怎么样？我不知道，但我将如何比较这两次尝试。有没有这样的东西，我们可以比较两次尝试，并找到相似之处？谢谢你的回答。我还是个新手。对两个句子进行两次尝试并比较这些尝试之间的相似性是否有意义？我认为您只需要为其中一个句子创建一个trie，然后搜索另一个句子中的单词是否出现在其中（即，单词出现在两个句子中）。