Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/306.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python NLTK创建带有句子边界的双格图_Python_Nltk - Fatal编程技术网

Python NLTK创建带有句子边界的双格图

Python NLTK创建带有句子边界的双格图,python,nltk,Python,Nltk,我正在尝试使用nltk创建不跨越句子边界的bigram。我尝试使用from_文档,但是,它并没有像我希望的那样工作 import nltk from nltk.collocations import * bigram_measures = nltk.collocations.BigramAssocMeasures() finder = BigramCollocationFinder.from_documents([['This', 'is', 'sentence', 'one'], ['A',

我正在尝试使用nltk创建不跨越句子边界的bigram。我尝试使用from_文档,但是,它并没有像我希望的那样工作

import nltk
from nltk.collocations import *
bigram_measures = nltk.collocations.BigramAssocMeasures()

finder = BigramCollocationFinder.from_documents([['This', 'is', 'sentence', 'one'], ['A', 'second', 'sentence']])
print finder.nbest(bigram_measures.pmi, 10)

>> [(u'A', u'second'), (u'This', u'is'), (u'one', u'A'), (u'is', u'sentence'), (u'second', u'sentence'), (u'sentence', u'one')]

这包括(u'one',u'A'),这是我试图避免的。

我最终放弃了nltk并手工处理:

为了创建ngram,我在

在此基础上,我计算了二元概率,如下所示:

首先,我创造了大人物

all_bigrams = [find_ngrams(sentence, 2) for sentence in text]
然后我按第一个单词将它们分组

first_words = {}
for bigram in all_bigrams:
    if bigram[0] in first_words.keys():
        first_words[bigram[0]].append(bigram)
    else:
        first_words[bigram[0]] = [bigram]
然后我计算了每个二元图的概率

bi_probabilites = {}
for bigram in (set(all_bigrams)):
    bigram_count = 0
    first_word_list = first_words[bigram[0]]
    for item in first_word_list:
        if item == bigram:
            bigram_count += 1
    bi_probabilites[bigram] = {
        'count': bigram_count, 
        'length': len(first_word_list), 
        'prob': float(bigram_count)/len(first_word_list)
    }
虽然不是最优雅的,但它完成了任务

bi_probabilites = {}
for bigram in (set(all_bigrams)):
    bigram_count = 0
    first_word_list = first_words[bigram[0]]
    for item in first_word_list:
        if item == bigram:
            bigram_count += 1
    bi_probabilites[bigram] = {
        'count': bigram_count, 
        'length': len(first_word_list), 
        'prob': float(bigram_count)/len(first_word_list)
    }