Python 正确执行“执行”原则；“三阶”；捏合键平滑（适用于三角模型）_Python_Nlp_Smoothing

Python 正确执行“执行”原则；“三阶”；捏合键平滑（适用于三角模型）

python nlp

Python 正确执行“执行”原则；“三阶”；捏合键平滑（适用于三角模型）,python,nlp,smoothing,Python,Nlp,Smoothing,在下面的代码中，我试图根据基于固定折扣的Knesr-Kney平滑方法计算三重图的概率。我翻阅了来自和描述克内尔·克内伊的重要文件。关于堆栈交换的这个[问题]（）是对bi-gram案例的一个很好的总结我发现很难根据tri-gram案例的数学形式来实现Kneser-Ney，因为它们相当复杂且难以理解。经过长时间的搜索，我找不到代码对该方法的解释我假设一个封闭词汇表，并想检查此代码是否是正确的实现根据克内西·克内伊的说法，具体来说，函数score\u trigram（self，tri\u g）

在下面的代码中，我试图根据基于固定折扣的Knesr-Kney平滑方法计算三重图的概率。我翻阅了来自和描述克内尔·克内伊的重要文件。关于堆栈交换的这个[问题]（）是对bi-gram案例的一个很好的总结

我发现很难根据tri-gram案例的数学形式来实现Kneser-Ney，因为它们相当复杂且难以理解。经过长时间的搜索，我找不到代码对该方法的解释

我假设一个封闭词汇表，并想检查此代码是否是正确的实现

根据克内西·克内伊的说法，具体来说，函数

score\u trigram（self，tri\u g）

将一个tri-gram作为元组（'u'，'v'，'w'），并尝试计算其概率的对数。init方法中显示的dicts存储基于一些语料库学习的单字图、双字图和三字图的频率

只需假设这些频率计数已正确初始化和给定。

如果我们有一个三元图（a，b，c），那么对于非零计数的三元图，Kneser kney的高级公式：

p（（a，b，c））=p_ML_贴现（（a，b，c））+总贴现率*p_KN（（b，c））

折扣（（a，b，c））=计数（（a，b，c））-折扣/计数（（a，b））

总折扣1=折扣*跟进计数（（a，b））/计数（（a，b））

p_KN（（b，c））=连续的（（b，c））计数/唯一三元的计数+ 总折扣2*P\U KN（c）

总折扣2=折扣+后续计数（b）/计数\u唯一\u bigrams

p_KN（c）=继续_计数（c）-折扣/计数_唯一_bigrams+折扣*1/词汇大小

我有两个问题：
1-对于Knesery-Kney三联图，前面的公式正确吗

2-代码中相应的评分功能是否正确执行

类CustomLanguageModel：

def __init__(self, corpus):
    """Initialize your data structures in the constructor."""
    ### n-gram counts
    # trigram dict entry > ('word_a','word_b','word_c') : 10
    self.trigramCounts = collections.defaultdict(lambda: 0)

    # bigram dict entry > ('word_a','word_b') : 11
    self.bigramCounts = collections.defaultdict(lambda: 0)

    # unigram dict entry > 'word_a' : 15
    self.unigramCounts = collections.defaultdict(lambda: 0)

    ###Kneser-kney(KN) counts

    '''The follow_up count of a bi-gram (a,b) is the number of unique tri-grams 
    starts with (a,b), for example if the frequency of (a,b,c) tri-gram is 3,
    this increments the follow_up count of (a,b) by one,also if the frequency
    of (a,b,d) is 5 this adds one to the continuation count of (y,z).'''
    # dict entry as >  ('word_a','word_b') : 7
    self.bigram_follow_up_dict = collections.defaultdict(lambda: 0)

    '''The continuation count of a bigram (y,z) is the number of unique trigrams
    ends with (y,z), for example if the frequency of (x,y,z) trigram is 3,
    this increments the continuation count of (y,z) by one,
    also if the frequency of (r,y,z) is 5 this adds one to the continuation count of (y,z).'''
    # dict entry as > ('word_a','word_b') : 5
    self.bigram_continuation_dict = collections.defaultdict(lambda: 0)

    '''The continuation count of a unigam 'z' is the number of unique bigrams ends
    with 'z',for example if the frequency of ('y','z') bigram is 3, this increments 
    the continuation count of 'z' by one. Also if the frequency of ('w','z') is 5,
    this adds one to the continuation count of 'z'.
    '''
    # dict entry as >  'word_z' : 5
    self.unigram_continuation_count = collections.defaultdict(lambda: 0)

    '''The follow-up count of a unigam 'a' is the number of unique bigrams starts
    with 'a',for example if the frequency of ('a','b') bigram is 3, this increments
    the continuation count of 'a' by one. Also if the frequency of ('a','c') is 5,
    this adds one to the continuationcount of 'a'. '''
    # dict entry as >  'word_a' : 5
    self.unigram_follow_up_count = collections.defaultdict(lambda: 0)

    # total number of words, fixed discount
    self.total =0 , self.d=0.75 ,self.train(corpus)

def train(self, corpus):
    # count and initialize the dictionaries
    pass
def score_trigram(self,tri_g): 

    score = 0.0 , w1 = tri_g[0], w2 = tri_g[1] , w3 = tri_g[2]
    # use the trigram if it has a frequency > 0
    if self.trigramCounts[(w1,w2,w3)] > 0 and self.bigramCounts[(w1,w2)] > 0 :
        score += self.top_level_trigram_prob(*tri_g)
    # otherwise use the bigram (w2,w3) as an approximation
    else :
        if self.bigramCounts[(w2,w3)] > 0  and self.unigramCounts[w2]> 0:
            score = score + self.top_level_bigram_prob(w2,w3)
        ## otherwise use the unigram w3 as an approximation
        else:
            score += math.log(self.pkn_unigram(w3))               
    return score

def top_level_trigram_prob(self,w1,w2,w3):
    score=0.0
    term1 = max(self.trigramCounts[(w1,w2,w3)]-self.d,0)/self.bigramCounts[(w1,w2)]
    alfa = self.d * self.bigram_follow_set[(w1,w2)] / len(self.bigram_follow_set)
    term2 = self.pkn_bigram(w2,w3)
    score += math.log(term1+ alfa* term2)
    return score  

def top_level_bigram_prob(self,w1,w2):
    score=0.0
    term1 = max(self.bigramCounts[(w1,w2)]-self.d,0)/self.unigramCounts[w1]
    alfa = self.d * self.unigram_follow_set[w1]/self.unigramCounts[w1]
    term2 = self.pkn_unigram (w2)
    score += math.log(term1+ alfa* term2)
    return score 

def pkn_bigram(self,w1,w2):           
    return self.pkn_bigram_contuation(w1,w2) + self.pkn_bigram_follow_up(w1) * self.pkn_unigram(w2)


def pkn_bigram_contuation (self,w1,w2):
    ckn= self.bigram_continuation_dict[(w1,w2)]
    term1 = (max(ckn -self.d,0)/len(self.bigram_continuation_dict))        
    return term1

def pkn_bigram_follow_up (self,w1):
    ckn = self.unigram_follow_dict[w1]
    alfa = self.d * ckn / len(self.bigramCounts)
    return alfa  

def pkn_unigram (self,w1):
    #continuation of w1 + lambda uniform
    ckn= self.unigram_continuation_dict[w1]
    p_cont= float(max(ckn - self.d,0)) / len(self.bigramCounts)+ 1.0/len(self.unigramCounts )
    return p_cont

让我回答你的第一个问题

下面我标注了你的方程式（我根据你的代码更正了你在（5）中的输入错误，并在（2）和（6）中添加了max（，0）

（1） p（（a，b，c））=p_ML_贴现（（a，b，c））+总贴现率*p_KN（（b，c））

（2）折扣的p_ML_（（a，b，c））=最大值（计数（（a，b，c））-折扣，0）/计数（（a，b））

（3）总折扣1=折扣*跟进计数（（a，b））/计数（（a，b））

（4） p_KN（（b，c））=连续的（（b，c））计数/唯一三元的计数+总折扣2*p_KN（c）

（5）总折扣2=折扣*跟进计数（b）/计数唯一的比格拉姆

（6） p_KN（c）=最大值（续数_计数（c）-折扣，0）/计数_唯一_bigrams+折扣*1/词汇大小

关于上述方程式的正确性：

（1） ~（3）：正确

（4）（5）：不正确。在这两个等式中，唯一三角形的计数应替换为“第二个单词为b的唯一三角形计数”，即形式为（，b，）的唯一三角形计数

我在你的代码中看到，pkn_bigram_contuation（）对（（b，c））的继续计数进行了折扣，这是正确的。但这并没有反映在你的等式（4）中

（6）我认为您使用的是来自的实现等式（4.37）。问题是，作者不清楚如何计算\lambda（\epsilon）以使单图概率正常化

实际上，单位格概率不需要打折（参见第5页标题为“Kneser-Ney详细信息”的幻灯片），因此（6）可以简单地

p_KN（c）=连续计数（c）/唯一计数（b）

在开始实施之前，请准确地向我们展示您正在实施的方程式。这是我关于stackoverflow的第一个问题，方程式是否包含明确的内容？您能否使用数学环境来提高方程式的可读性？据我所知，数学环境在stackoverflow上未启用，它只是用于数学交换之类的，不是吗？你是根据什么说的？“第二个单词是b的唯一三角形的计数”应该替换为“第二个单词是b的唯一三角形的计数”？