Python：将字典值中的短语与句子（字典键）匹配，并根据匹配结果输出_Python

Python：将字典值中的短语与句子（字典键）匹配，并根据匹配结果输出

python

Python：将字典值中的短语与句子（字典键）匹配，并根据匹配结果输出,python,Python,我有一本字典，其中每个键都是一个句子，值是该句子中的特定单词或短语例如： dict1 = {'it is lovely weather and it is kind of warm':['lovely weather', 'it is kind of warm'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']} 我希望根据词组是否在字典值中对输出的每

我有一本字典，其中每个键都是一个句子，值是该句子中的特定单词或短语

例如：

dict1 = {'it is lovely weather and it is kind of warm':['lovely weather', 'it is kind of warm'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}

我希望根据词组是否在字典值中对输出的每个句子进行标记

在本例中，输出为（其中0不在值中，1在值中）

我可以通过硬编码短语中的字数来实现类似的功能：

for k,v in dict1.items():
   words_in_val = v.split()
   if len(words_in_val) == 1:
      words = k.split()
      for each_word in words:
             if v == each_word:
                   print(each_word + '\t' + '1')
             else:
                   print(each_word + '\t' + '0')


     if len(words_in_val) == 2::
         words = k.split()
         for index,item in enumerate(words[:-1]):
                if words[index] == words_in_val[0]:
                       if words[index+1] == words_in_val[1]:
                              words[index] = ' '.join(words_in_val)
                              words.remove(words[index+1])
                              ....something like this...

我的问题是，我可以看到它开始变得混乱，而且理论上，我可以在我想要匹配的短语中有无限数量的单词，尽管它通常是，所以我会这样做：

from collections import defaultdict

dict1 = {'it is lovely weather and it is kind of warm':['it is kind of', 'it is kind'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}

def tag_sentences(dict):
    id = 1
    tagged_results = []
    for sentence, phrases in dict.items():
        words = sentence.split()
        phrases_split = [phrase.split() for phrase in phrases]
        positions_keeper = {}
        sentence_results = [(word, 0) for word in words]
        for word_index, word in enumerate(words):
            for index, phrase in enumerate(phrases_split):
                position = positions_keeper.get(index, 0)
                if phrase[position] == word:
                    if len(phrase) > position + 1:
                        positions_keeper[index] = position + 1
                    else:
                        for i in range(len(phrase)):
                            sentence_results[word_index - i] = (sentence_results[word_index - i][0], id)
                        id = id + 1
                else:
                    positions_keeper[index] = 0
        tagged_results.append(sentence_results)
    return tagged_results

def print_tagged_results(tagged_results):
    for tagged_result in tagged_results:
        memory = 0
        memory_sentence = ""
        for result, id in tagged_result:
            if memory != 0 and memory != id:
                print(memory_sentence + "1")
                memory_sentence = ""
            if id == 0:
                print(result, 0)
            else:
                memory_sentence += result + " "
            memory = id
        if memory != 0:
            print(memory_sentence + "1")

tagged_results = tag_sentences(dict1)
print_tagged_results(tagged_results)

这基本上是在做以下工作：

首先，我以以下格式创建一个标记列表：

[（it，0），（is，0），（可爱的，0）…]

在标记列表中，我将0=>标记为不在组中，将其他整数标记为分组在一起（标记为1的单词分组在一起，标记为2的单词分组在一起）

我遍历每个单词，如果它与短语的开头匹配，或者如果我已经处于当前短语位置的循环中，则标记它

如果是短语的结尾，我会用相同的id标记该单词以及过去与该短语匹配的所有单词

如果它不是结束，我将保持这个位置并开始下一个迭代

最后，我有一个标签列表，格式为

[（it，0），（is，0），（可爱的，1）…（kind，2），（of，2），…]

如果一个短语是另一个短语的子短语，它将不起作用，但您从未在示例中提到它应如何应对这种情况。

因此，我将这样做：

from collections import defaultdict

dict1 = {'it is lovely weather and it is kind of warm':['it is kind of', 'it is kind'],'and the weather is rainy and cold':['rainy and cold'],'the temperature is ok':['temperature']}

def tag_sentences(dict):
    id = 1
    tagged_results = []
    for sentence, phrases in dict.items():
        words = sentence.split()
        phrases_split = [phrase.split() for phrase in phrases]
        positions_keeper = {}
        sentence_results = [(word, 0) for word in words]
        for word_index, word in enumerate(words):
            for index, phrase in enumerate(phrases_split):
                position = positions_keeper.get(index, 0)
                if phrase[position] == word:
                    if len(phrase) > position + 1:
                        positions_keeper[index] = position + 1
                    else:
                        for i in range(len(phrase)):
                            sentence_results[word_index - i] = (sentence_results[word_index - i][0], id)
                        id = id + 1
                else:
                    positions_keeper[index] = 0
        tagged_results.append(sentence_results)
    return tagged_results

def print_tagged_results(tagged_results):
    for tagged_result in tagged_results:
        memory = 0
        memory_sentence = ""
        for result, id in tagged_result:
            if memory != 0 and memory != id:
                print(memory_sentence + "1")
                memory_sentence = ""
            if id == 0:
                print(result, 0)
            else:
                memory_sentence += result + " "
            memory = id
        if memory != 0:
            print(memory_sentence + "1")

tagged_results = tag_sentences(dict1)
print_tagged_results(tagged_results)

这基本上是在做以下工作：

首先，我以以下格式创建一个标记列表：

[（it，0），（is，0），（可爱的，0）…]

在标记列表中，我将0=>标记为不在组中，将其他整数标记为分组在一起（标记为1的单词分组在一起，标记为2的单词分组在一起）

我遍历每个单词，如果它与短语的开头匹配，或者如果我已经处于当前短语位置的循环中，则标记它

如果是短语的结尾，我会用相同的id标记该单词以及过去与该短语匹配的所有单词

如果它不是结束，我将保持这个位置并开始下一个迭代

最后，我有一个标签列表，格式为

[（it，0），（is，0），（可爱的，1）…（kind，2），（of，2），…]

如果一个短语是另一个短语的子短语，这是行不通的，但您从未在示例中提到过它应该如何应对这种情况。

这个问题是否因为太模糊而被解决了？谢谢，我没有意识到这会如此困难，我想我只是在努力将上面的内容变成一个循环。“作为一个概念单元站在一起的一小群单词，通常构成一个从句的一个组成部分。”。你必须让程序理解什么是“概念单元”“是的，我觉得这很难。嗯，我不这么认为，他不是已经拥有字典里所有的短语和单词了吗？”？因此，这是一个相当大的查找。问题是他用不同的长度做了所有这些，这可以在一段时间内解决（我只在脑海中思考），但我认为没有必要让程序理解短语本达尔这更像我所想的，这更像是将循环更改为类似“将短语按值分割……然后根据该短语中单词的长度……将关键字/句子分成相同长度的重叠块，并检查它们是否相等”（我只知道理论上如何做，不确定在现实生活中实际如何做）.这个问题是否因为太模糊而结束了？谢谢，我没有意识到这会如此困难，我以为我只是在努力把上面的问题变成一个循环。“一小群词作为一个概念单元站在一起，通常构成一个从句的一个组成部分。”。你必须让程序理解什么是“概念单位”，我觉得这很难。嗯，我不这么认为，他不是已经掌握了字典中所有的短语和单词吗？因此，这是一个相当大的查找。问题是他用不同的长度做了所有这些，这可以在一段时间内解决（我只在脑海中思考），但我认为没有必要让程序理解短语本达尔这更像我所想的，这更像是将循环更改为类似“将短语按值分割……然后根据该短语中单词的长度……将关键字/句子分割为相同长度的重叠块，并检查它们是否相等”（我只知道理论上如何做，不确定在实际生活中实际如何做）。