Python 按给定短语返回匹配项列表_Python_Nlp_Text Processing_Synonym

Python 按给定短语返回匹配项列表

python nlp

Python 按给定短语返回匹配项列表,python,nlp,text-processing,synonym,Python,Nlp,Text Processing,Synonym,我正在尝试制作一种方法，可以检查给定的短语是否与短语列表中的至少一项匹配，并返回它们。输入是短语、短语列表和同义词列表词典。关键是要使它具有普遍性以下是一个例子： phrase = 'This is a little house' dictSyns = {'little':['small','tiny','little'], 'house':['cottage','house']} listPhrases = ['This is a tiny house','This

我正在尝试制作一种方法，可以检查给定的短语是否与短语列表中的至少一项匹配，并返回它们。输入是短语、短语列表和同义词列表词典。关键是要使它具有普遍性

以下是一个例子：

phrase = 'This is a little house'
dictSyns = {'little':['small','tiny','little'],
            'house':['cottage','house']}
listPhrases = ['This is a tiny house','This is a small cottage','This is a small building','I need advice']

我可以创建一个代码，在这个示例中可以这样做，返回bool：

if any('This'+' '+'is'+' '+'a'+x+' '+y == phrase for x in dictSyns['little'] for y in dictSyns['house']):
    print 'match'

第一点是，我必须创建通用的函数（取决于结果）。第二，我希望这个函数返回匹配短语的列表

你能给我一个建议吗？在这种情况下，该方法返回

[“这是一座小房子”，“这是一座小别墅]

输出如下所示：

>>> getMatches(phrase, dictSyns, listPhrases)
['This is a tiny house','This is a small cottage']

我将采取以下做法：

import itertools

def new_phrases(phrase, syns):
    """Generate new phrases from a base phrase and synonyms."""
    words = [syns.get(word, [word]) for word in phrase.split(' ')]
    for t in itertools.product(*words):
        yield ' '.join(t)

def get_matches(phrase, syns, phrases):
    """Generate acceptable new phrases based on a whitelist."""
    phrases = set(phrases)
    for new_phrase in new_phrases(phrase, syns):
        if new_phrase in phrases:
            yield new_phrase

代码的根是

新短语中单词的赋值，它将短语
和syns
转换为更可用的形式，每个元素都是该单词可接受选择的列表：
>>> [syns.get(word, [word]) for word in phrase.split(' ')]
[['This'], ['is'], ['a'], ['small', 'tiny', 'little'], ['cottage', 'house']]

注意以下几点：

使用生成器更有效地处理大量组合（而不是一次构建整个列表）
使用集合
进行有效的成员资格测试（O（1）
，与列表的O（n）
相比）
使用生成基于syns
的短语
的可能组合（您也可以在实现时使用）；及
遵守

使用中：
>>> list(get_matches(phrase, syns, phrases))
['This is a small cottage', 'This is a tiny house']

需要考虑的事情：

字符的情况如何（例如，“下议院”
应如何处理）
标点符号呢
我是这样做的：
for value in dictSyns:
    phrase = phrase + dictSyns[value]

for each_phrase in listPhrases:
    if any(word not in phrase for word in each_phrase.split()):
        pass
    else:
        print each_phrase

可能效率不高。它创建了一个可接受单词的列表。然后将每个字符串中的每个单词与该列表进行比较，如果没有不可接受的单词，则打印短语
编辑：我也意识到这不能检查语法意义。例如，短语“littlethis a”仍然会返回为正确。它只是检查每个单词。我把这个放在这里是为了表示我的羞耻。
谢谢你，它帮了我很大的忙。非常好的方法。案例：我将更改这一行：如果[x.lowercase（）中的新短语.lowercase（）用于短语中的x]。。标点符号（coma和dot）：我会使用.strip（‘，’）.strip（‘.）@Milan注意，你的小写方法效率非常低，因为它对每个新短语
重新处理短语
，不使用集
，并且在生成新短语时不包含小写。您还必须仔细考虑剥离
的步骤（请注意，您可以只剥离（“，.”

）。