Python NLTK复杂度测度反演_Python_Machine Learning_Nltk

Python NLTK复杂度测度反演

python machine-learning

Python NLTK复杂度测度反演,python,machine-learning,nltk,Python,Machine Learning,Nltk,我已经给出了一个训练文本和一个测试文本。我想做的是通过训练数据来训练语言模型，以计算测试数据的复杂性这是我的代码： import os import requests import io #codecs from nltk.util import everygrams from nltk.lm.preprocessing import pad_both_ends from nltk import word_tokenize, sent_tokenize fileTest = open("

我已经给出了一个训练文本和一个测试文本。我想做的是通过训练数据来训练语言模型，以计算测试数据的复杂性

这是我的代码：

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends

from nltk import word_tokenize, sent_tokenize 

fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)


model = Laplace(n) 
model.fit(train_data, padded_sents)

print(model.perplexity(trainTest))

当我用n=1运行这段代码时，这是一个单位，我得到了1068.332393940235。对于n=2，或者说是bigram，我得到1644.3441077259993，而对于trigram，我得到2552.2085752565313

它有什么问题？

您创建测试数据的方式是错误的小写序列数据，但测试数据没有转换为小写。测试数据中缺少开始和结束标记。试试这个

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize 

"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"

n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = Laplace(1) 
model.fit(train_data, padded_sents)

s = 0
for i, test in enumerate(test_data):
    p = model.perplexity(test)
    s += p

print ("Perplexity: {0}".format(s/(i+1)))

谢谢，但我认为你遗漏了一点，你应该把n传递到拉帕尔函数中，对吗？你能解释一下为什么你总结了所有的困惑吗？这是否正确？传递给lapalce的值是平滑参数，通常大于0。它和n个单字符、双字符或ngram无关。我们正在计算每个测试句子的复杂度，因为复杂度方法只接受单个生成器，而不是生成器列表，并最终对它们进行平均，正如您在print语句中看到的那样。所以困惑=s/i+1Okay，所以如果我想使用MLE，比如我必须通过n，对吗？对于MLE，是的，最高的ngram，也就是说，在你的例子中，必须通过n