在Python中使用NLTK应用bigrams列表中的搭配_Python_Nlp_Nltk

在Python中使用NLTK应用bigrams列表中的搭配

python nlp

在Python中使用NLTK应用bigrams列表中的搭配,python,nlp,nltk,Python,Nlp,Nltk,我必须在几个句子中找到并运用搭配。这些句子存储在一个字符串列表中。现在让我们只关注一句话。下面是一个例子： sentence = 'I like to eat the ice cream in new york' 以下是我最终想要的： sentence_final = 'I like to eat the ice_cream in new_york' 我使用Python NLTK查找搭配，并且我能够创建一个集合，其中包含所有句子中所有可能的搭配。下面是一个集合示例： set_colloc

我必须在几个句子中找到并运用搭配。这些句子存储在一个字符串列表中。现在让我们只关注一句话。下面是一个例子：

sentence = 'I like to eat the ice cream in new york'

以下是我最终想要的：

sentence_final = 'I like to eat the ice_cream in new_york'

我使用Python NLTK查找搭配，并且我能够创建一个集合，其中包含所有句子中所有可能的搭配。下面是一个集合示例：

set_collocations = set([('ice', 'cream'), ('new', 'york'), ('go', 'out')])

事实上，它显然更大

我创建了以下函数，该函数将返回新函数，并按上述方式进行修改：

def apply_collocations(sentence, set_colloc):
    window_size = 2
    words = sentence.lower().split()
    list_bigrams = list(nltk.bigrams(words))
    set_bigrams=set(list_bigrams)
    intersect = set_bigrams.intersection(set_colloc)
    print(set_colloc)
    print(set_bigrams)
    #  No collocation in this sentence
    if not intersect:
        return sentence
    #  At least one collocation in this sentence
    else:
        set_words_iters = set()
        # Create set of words of the collocations
        for bigram in intersect:
            set_words_iters.add(bigram[0])
            set_words_iters.add(bigram[1])
        # Sentence beginning
        if list_bigrams[0][0] not in set_words_iters:
            new_sentence = list_bigrams[0][0]
            begin = 1
        else:
            new_sentence = list_bigrams[0][0] + '_' + list_bigrams[0][1]
            begin = 2

        for i in range(begin, len(list_bigrams)):
            print(new_sentence)
            if list_bigrams[i][1] in set_words_iters and list_bigrams[i] in intersect:
                new_sentence += ' ' + list_bigrams[i][0] + '_' + list_bigrams[i][1]
            elif list_bigrams[i][1] not in set_words_iters:
                new_sentence += ' ' + list_bigrams[i][1]
        return new_sentence

问题2:

有没有更优化的方法？由于我对NLTK有点不熟悉，有人能告诉我是否有一种直接的方法可以将搭配应用到某个文本中吗？我的意思是，一旦我确定了我认为搭配的二重词，是否有一些函数或快速的方法来修改我的句子？

对于搭配集中的每个元素，只需将字符串x y替换为x_y即可：

def apply_collocations(sentence, set_colloc):
    res = sentence.lower()
    for b1,b2 in set_colloc:
        res = res.replace("%s %s" % (b1 ,b2), "%s_%s" % (b1 ,b2))
    return res