Python 如何在三元语言模型的单词级NLTK中执行Kneser-Ney平滑?

Python 如何在三元语言模型的单词级NLTK中执行Kneser-Ney平滑?,python,nlp,nltk,pytorch,trigram,Python,Nlp,Nltk,Pytorch,Trigram,我试图在文本语料库上训练一个三元语言模型,并希望执行KN平滑。显然,“nltk.trigrams”是在角色级别实现的。我想知道我怎样才能在单词级上做到这一点,并执行KN平滑。下面是我编写的一段代码,但不起作用: with open('file.txt',"r",encoding = "ISO-8859-1") as ff: text = ff.read() word_tok = tknzr.tokenize(text) ngrams = nltk.tri

我试图在文本语料库上训练一个三元语言模型,并希望执行KN平滑。显然,“nltk.trigrams”是在角色级别实现的。我想知道我怎样才能在单词级上做到这一点,并执行KN平滑。下面是我编写的一段代码,但不起作用:

    with open('file.txt',"r",encoding = "ISO-8859-1") as ff:
        text = ff.read()

    word_tok = tknzr.tokenize(text)
    ngrams = nltk.trigrams(word_tok)
    freq_dist = nltk.FreqDist(ngrams)
    kneser_ney = nltk.KneserNeyProbDist(freq_dist)
    print(kneser_ney.prob('you go to'))
我得到一个错误:

    Expected an iterable with 3 members.
更换线路:

print(kneser_ney.prob('you go to'))
与:

那就行了。当使用从古腾堡项目下载的小说《白鲸》中的文本作为训练文件时,我得到的值为0.05217391304347826

通过此修改,您的代码将类似于以下内容:

with open('./txts/mobyDick.txt') as ff:
    text = ff.read()


from nltk import word_tokenize,trigrams
from nltk import FreqDist, KneserNeyProbDist


word_tok = word_tokenize(text)
ngrams = trigrams(word_tok)
freq_dist = FreqDist(ngrams)
kneser_ney = KneserNeyProbDist(freq_dist)
print(kneser_ney.prob('you go to'.split()))

此外,这里的所有操作都是在单词级而不是字符级完成的:

ngrams = trigrams(word_tok)
for _ in range(0,10):
    print(next(ngrams))

#('\ufeff', 'The', 'Project')
#('The', 'Project', 'Gutenberg')
#('Project', 'Gutenberg', 'EBook')
#('Gutenberg', 'EBook', 'of')
#('EBook', 'of', 'Moby')
#('of', 'Moby', 'Dick')
#('Moby', 'Dick', ';')
#('Dick', ';', 'or')
#(';', 'or', 'The')
#('or', 'The', 'Whale')

频率分布也在单词级:

freq_dist.freq(tuple('on the ocean'.split()))
#7.710783916846906e-06
freq_dist.freq(tuple('new Intel CPU'.split()))
#0.0


freq_dist.freq(tuple('on the ocean'.split()))
#7.710783916846906e-06
freq_dist.freq(tuple('new Intel CPU'.split()))
#0.0