Python NLTK复杂度测度反演

Python NLTK复杂度测度反演,python,machine-learning,nltk,Python,Machine Learning,Nltk,我已经给出了一个训练文本和一个测试文本。我想做的是通过训练数据来训练语言模型,以计算测试数据的复杂性 这是我的代码: import os import requests import io #codecs from nltk.util import everygrams from nltk.lm.preprocessing import pad_both_ends from nltk import word_tokenize, sent_tokenize fileTest = open("

我已经给出了一个训练文本和一个测试文本。我想做的是通过训练数据来训练语言模型,以计算测试数据的复杂性

这是我的代码:

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends

from nltk import word_tokenize, sent_tokenize 

fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace

n = 1
padded_bigrams = list(pad_both_ends(word_tokenize(textTest), n=1))
trainTest = everygrams(padded_bigrams, min_len = n , max_len=n);

train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)


model = Laplace(n) 
model.fit(train_data, padded_sents)

print(model.perplexity(trainTest)) 
当我用n=1运行这段代码时,这是一个单位,我得到了1068.332393940235。对于n=2,或者说是bigram,我得到1644.3441077259993,而对于trigram,我得到2552.2085752565313


它有什么问题?

您创建测试数据的方式是错误的小写序列数据,但测试数据没有转换为小写。测试数据中缺少开始和结束标记。试试这个

import os
import requests
import io #codecs
from nltk.util import everygrams
from nltk.lm.preprocessing import pad_both_ends
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import Laplace
from nltk import word_tokenize, sent_tokenize 

"""
fileTest = open("AaronPressman.txt","r");
with io.open('AaronPressman.txt', encoding='utf8') as fin:
        textTest = fin.read()
if os.path.isfile('AaronPressmanEdited.txt'):
    with io.open('AaronPressmanEdited.txt', encoding='utf8') as fin:
        text = fin.read()
"""
textTest = "This is an ant. This is a cat"
text = "This is an orange. This is a mango"

n = 2
# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(text)]
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
                for sent in sent_tokenize(textTest)]
test_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = Laplace(1) 
model.fit(train_data, padded_sents)

s = 0
for i, test in enumerate(test_data):
    p = model.perplexity(test)
    s += p

print ("Perplexity: {0}".format(s/(i+1)))

谢谢,但我认为你遗漏了一点,你应该把n传递到拉帕尔函数中,对吗?你能解释一下为什么你总结了所有的困惑吗?这是否正确?传递给lapalce的值是平滑参数,通常大于0。它和n个单字符、双字符或ngram无关。我们正在计算每个测试句子的复杂度,因为复杂度方法只接受单个生成器,而不是生成器列表,并最终对它们进行平均,正如您在print语句中看到的那样。所以困惑=s/i+1Okay,所以如果我想使用MLE,比如我必须通过n,对吗?对于MLE,是的,最高的ngram,也就是说,在你的例子中,必须通过n