Nlp NLTK中的单字标记

Nlp NLTK中的单字标记,nlp,nltk,stanford-nlp,allennlp,Nlp,Nltk,Stanford Nlp,Allennlp,使用NLTKUnigram标记器,我在Brown语料库中训练句子 我尝试了不同的类别,得到了大致相同的值。该值约为0.9328。。。对于每个类别,例如小说,浪漫或幽默 from nltk.corpus import brown # Fiction brown_tagged_sents = brown.tagged_sents(categories='fiction') brown_sents = brown.sents(categories='fiction') unigram_ta

使用
NLTK
Unigram标记器,我在
Brown语料库中训练句子

我尝试了不同的
类别
,得到了大致相同的值。该值约为
0.9328
。。。对于每个
类别
,例如
小说
浪漫
幽默

from nltk.corpus import brown


# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324

为什么会这样?是因为他们来自同一个
语料库吗?或者它们的
词性标记是否相同?

看起来您正在训练,然后在相同的训练数据上评估经过训练的
UnigramTagger
。查看评估的文档,特别是关于评估的文档

使用您的代码,您将获得高分,这是非常明显的,因为您的培训数据和评估/测试数据是相同的。如果更改测试数据与训练数据不同的位置,则会得到不同的结果。我的例子如下:

类别:小说

在这里,我使用的训练集是
brown.tagged\u sents(categories='france')[:500]
,测试/评估集是
brown.tagged\u sents(categories='france')[501:600]

from nltk.corpus import brown
import nltk

# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])
from nltk.corpus import brown
import nltk

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])
这给您的分数约为0.74746106967359513

类别:浪漫

在这里,我使用的训练集是
brown.tagged\u sents(categories='roman')[:500]
,测试/评估集是
brown.tagged\u sents(categories='roman')[501:600]

from nltk.corpus import brown
import nltk

# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])
from nltk.corpus import brown
import nltk

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])
这将为您提供~0.7046799354491662的分数


我希望这有助于回答您的问题。

您可以发布您的代码,以便我们可以尝试复制此代码吗?@RahulP我用代码更新问题。