Python 如何在二元语言模型的单词级NLTK中执行Kneser-Ney平滑？_Python_Nlp_Nltk

Python 如何在二元语言模型的单词级NLTK中执行Kneser-Ney平滑？

python nlp

Python 如何在二元语言模型的单词级NLTK中执行Kneser-Ney平滑？,python,nlp,nltk,Python,Nlp,Nltk,从nltk包中，我看到我们可以仅使用三元图实现Kneser-Ney平滑，但当我尝试在bigrams上使用相同的函数时，它会抛出错误。有没有一种方法可以在bigram上实现平滑 ## Working code for trigrams tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \ form and moving how express and admira

从

nltk

包中，我看到我们可以仅使用三元图实现Kneser-Ney平滑，但当我尝试在

bigrams

上使用相同的函数时，它会抛出错误。有没有一种方法可以在bigram上实现平滑

## Working code for trigrams 
tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
    form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
    the beauty of the world, the paragon of animals!".split()
gut_ngrams = nltk.ngrams(tokens,3)
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

首先，让我们看看代码和实现。当我们使用bigrams时：

import nltk

tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
    form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
    the beauty of the world, the paragon of animals!".split()
gut_ngrams = nltk.ngrams(tokens,2)
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

代码抛出一个错误：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-1ce73b806bb8> in <module>
      4 gut_ngrams = nltk.ngrams(tokens,2)
      5 freq_dist = nltk.FreqDist(gut_ngrams)
----> 6 kneser_ney = nltk.KneserNeyProbDist(freq_dist)

~/.pyenv/versions/3.8.0/lib/python3.8/site-packages/nltk/probability.py in __init__(self, freqdist, bins, discount)
   1737         self._trigrams_contain = defaultdict(float)
   1738         self._wordtypes_before = defaultdict(float)
-> 1739         for w0, w1, w2 in freqdist:
   1740             self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
   1741             self._wordtypes_after[(w0, w1)] += 1

ValueError: not enough values to unpack (expected 3, got 2)

我们看到，在初始化过程中，在计算当前单词之前的n-gram和之后的n-gram时，有一些假设：

 for w0, w1, w2 in freqdist:
        self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
        self._wordtypes_after[(w0, w1)] += 1
        self._trigrams_contain[w1] += 1
        self._wordtypes_before[(w1, w2)] += 1

在这种情况下，对于

KneserNeyProbDist

对象，只有三叉图与KN平滑一起工作

让我们用四克试试： [out]：

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-60a48ed2ffce> in <module>
      4 gut_ngrams = nltk.ngrams(tokens,4)
      5 freq_dist = nltk.FreqDist(gut_ngrams)
----> 6 kneser_ney = nltk.KneserNeyProbDist(freq_dist)

~/.pyenv/versions/3.8.0/lib/python3.8/site-packages/nltk/probability.py in __init__(self, freqdist, bins, discount)
   1737         self._trigrams_contain = defaultdict(float)
   1738         self._wordtypes_before = defaultdict(float)
-> 1739         for w0, w1, w2 in freqdist:
   1740             self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
   1741             self._wordtypes_after[(w0, w1)] += 1

ValueError: too many values to unpack (expected 3)

问题是为什么它会抛出错误？你在干什么；P和

tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
    form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
    the beauty of the world, the paragon of animals!".split()
gut_ngrams = nltk.ngrams(tokens,4)
freq_dist = nltk.FreqDist(gut_ngrams)
kneser_ney = nltk.KneserNeyProbDist(freq_dist)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-60a48ed2ffce> in <module>
      4 gut_ngrams = nltk.ngrams(tokens,4)
      5 freq_dist = nltk.FreqDist(gut_ngrams)
----> 6 kneser_ney = nltk.KneserNeyProbDist(freq_dist)

~/.pyenv/versions/3.8.0/lib/python3.8/site-packages/nltk/probability.py in __init__(self, freqdist, bins, discount)
   1737         self._trigrams_contain = defaultdict(float)
   1738         self._wordtypes_before = defaultdict(float)
-> 1739         for w0, w1, w2 in freqdist:
   1740             self._bigrams[(w0, w1)] += freqdist[(w0, w1, w2)]
   1741             self._wordtypes_after[(w0, w1)] += 1

ValueError: too many values to unpack (expected 3)

from nltk.lm import KneserNeyInterpolated
from nltk.lm.preprocessing import padded_everygram_pipeline

tokens = "What a piece of work is man! how noble in reason! how infinite in faculty! in \
    form and moving how express and admirable! in action how like an angel! in apprehension how like a god! \
    the beauty of the world, the paragon of animals!".split()

n = 4 # Order of ngram
train_data, padded_sents = padded_everygram_pipeline(n, tokens)

model = KneserNeyInterpolated(n) 
model.fit(train_data, padded_sents)