Nlp NLTK中的单字标记_Nlp_Nltk_Stanford Nlp_Allennlp

Nlp NLTK中的单字标记

nlp stanford-nlp

Nlp NLTK中的单字标记,nlp,nltk,stanford-nlp,allennlp,Nlp,Nltk,Stanford Nlp,Allennlp,使用NLTKUnigram标记器，我在Brown语料库中训练句子我尝试了不同的类别，得到了大致相同的值。该值约为0.9328。。。对于每个类别，例如小说，浪漫或幽默 from nltk.corpus import brown # Fiction brown_tagged_sents = brown.tagged_sents(categories='fiction') brown_sents = brown.sents(categories='fiction') unigram_ta

使用

NLTK

Unigram标记器，我在

Brown语料库中训练句子

我尝试了不同的

类别

，得到了大致相同的值。该值约为

0.9328

。。。对于每个

类别

，例如

小说

，

浪漫

或

幽默

from nltk.corpus import brown


# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')
brown_sents = brown.sents(categories='fiction')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9415956079897209

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')
brown_sents = brown.sents(categories='romance')
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown_tagged_sents)
>>> 0.9348490474422324

为什么会这样？是因为他们来自同一个

语料库吗？或者它们的词性标记是否相同？
看起来您正在训练，然后在相同的训练数据上评估经过训练的UnigramTagger
。查看评估的文档，特别是关于评估的文档
使用您的代码，您将获得高分，这是非常明显的，因为您的培训数据和评估/测试数据是相同的。如果更改测试数据与训练数据不同的位置，则会得到不同的结果。我的例子如下：
类别：小说
在这里，我使用的训练集是brown.tagged\u sents（categories='france'）[：500]
，测试/评估集是brown.tagged\u sents（categories='france'）[501:600]

from nltk.corpus import brown
import nltk

# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])

from nltk.corpus import brown
import nltk

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])

这给您的分数约为0.74746106967359513
类别：浪漫
在这里，我使用的训练集是brown.tagged\u sents（categories='roman'）[：500]
，测试/评估集是brown.tagged\u sents（categories='roman'）[501:600]

from nltk.corpus import brown
import nltk

# Fiction    
brown_tagged_sents = brown.tagged_sents(categories='fiction')[:500]
brown_sents = brown.sents(categories='fiction') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='fiction')[501:600])

from nltk.corpus import brown
import nltk

# Romance
brown_tagged_sents = brown.tagged_sents(categories='romance')[:500]
brown_sents = brown.sents(categories='romance') # not sure what this line is doing here
unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)
unigram_tagger.evaluate(brown.tagged_sents(categories='romance')[501:600])

这将为您提供~0.7046799354491662的分数
我希望这有助于回答您的问题。
您可以发布您的代码，以便我们可以尝试复制此代码吗？@RahulP我用代码更新问题。