Python 培训NLTK Brill标记器，但使用txt文件作为输入_Python_Nltk_Pos Tagger

Python 培训NLTK Brill标记器，但使用txt文件作为输入

python

Python 培训NLTK Brill标记器，但使用txt文件作为输入,python,nltk,pos-tagger,Python,Nltk,Pos Tagger,大家好。我现在正在做我最后一年的项目，名为“使用Brill标记器的马来语词性标记器” 我想问一下如何训练我保存在txt文件中的标记句子？输入应为txt文件，然后使用brill tagger进行训练。之后，我将使用一个txt文件作为测试数据。但是，我被困在火车上了，你能帮我吗这是我的一些代码 import nltk f = open('gayahidupsihat_tagged.txt') malay_tagged = f.read() def train_brill_tagge

大家好。我现在正在做我最后一年的项目，名为“使用Brill标记器的马来语词性标记器”

我想问一下如何训练我保存在txt文件中的标记句子？输入应为txt文件，然后使用brill tagger进行训练。之后，我将使用一个txt文件作为测试数据。但是，我被困在火车上了，你能帮我吗

这是我的一些代码

import nltk  
f = open('gayahidupsihat_tagged.txt')  
malay_tagged = f.read()   

def train_brill_tagger(train_data):
    # Modules for creating the templates.
    from nltk.tag import UnigramTagger
    from nltk.tag.brill import SymmetricProximateTokensTemplate, ProximateTokensTemplate
    from nltk.tag.brill import ProximateTagsRule, ProximateWordsRule
    # The brill tagger module in NLTK.
    from nltk.tag.brill import FastBrillTaggerTrainer
    unigram_tagger = UnigramTagger(train_data)
    templates = [SymmetricProximateTokensTemplate(ProximateTagsRule, (1,1)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (2,2)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (1,2)),
                 SymmetricProximateTokensTemplate(ProximateTagsRule, (1,3)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,1)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (2,2)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,2)),
                 SymmetricProximateTokensTemplate(ProximateWordsRule, (1,3)),
                 ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1,1)),
                 ProximateTokensTemplate(ProximateWordsRule, (-1, -1), (1,1))]

    trainer = FastBrillTaggerTrainer(initial_tagger=unigram_tagger,
                                   templates=templates, trace=3,
                                   deterministic=True)
    brill_tagger = trainer.train(train_data, max_rules=10)
    print
    return brill_tagger    

malay_train = (malay_tagged[:10]) 
malay_test = (malay_tagged[10:15]) 
malay20 = malay_tagged[20]

mt = train_brill_tagger(malay_train)    
print mt.tag(malay20)

实际上，我想训练一个带标签的段落，然后，我将使用其他段落测试它。之后，我将使用一个带标记的句子来评估brill标记器作为结果

例如：

我训练这个（

gayahidupsihat_train.txt

）——所有一行输入实际上：

Gaya\NN hidup\NN sihat\VB boleh\MD lah\UH ditakrifkan\VBZ sebagai\DT
satu\CD amalan\VBZ kehidupan\NN yang\DT membawa\VBZ impak\NN positif\NN
kepada\TO diri\NN seseorang\NN ,\, keluarganya\NN dan\CC masyarakat\NN.
Antara\IN contoh\NN kehidupan\NN yang\DT sihat\VB ialah\DT individu\NN
tersebut\EX hidup\VB dengan\DT penuh\RB ceria\RB tanpa\NN mengalami\VBZ
sebarang\NN masalah\NN yang\DT boleh\MD menjejaskan\VBZ kehidupannya\NN
untuk\TO satu\CD tempoh\NN tertentu\EX pula\DT .\. Sudah\EX pasti\RB
dalam\DT kehidupan\NN era\NN moden\NN yang\DT begitu\DT banyak\RB
tekanan\VB ini\DT gaya\NN hidup\NN sihat\VB menjadi\VBZ satu\NUM
matlamat\NN yang\DT perlu\MD dicapai\VBZ segera\VB. Oleh\PDT itu\DT ,\,
terdapat\EX pelbagai\NN tindakan\VBZ yang\DT boleh\MD dilakukan\VBZ
untuk\TO mencapai\VBZ matlamat\NN ini\DT .\.

然后我想用这个（

gayahidupsihat_test.txt

）进行测试：

之后，我将使用一些

标记的单词来尝试标记器并对其进行评估
英文版显示如下输出：
Training Brill tagger on 500 sentences...
Finding initial useful rules...
Found 10210 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  46  46   0   0  | TO -> IN if the tag of the following word is 'AT'
  18  20   2   0  | TO -> IN if the tag of words i+1...i+3 is 'CD'
  14  14   0   0  | IN -> IN-TL if the tag of the preceding word is
                  |   'NN-TL', and the tag of the following word is
                  |   'NN-TL'
  11  11   0   1  | TO -> IN if the tag of the following word is 'NNS'
  10  10   0   0  | TO -> IN if the tag of the following word is 'JJ'
   8   8   0   0  | , -> ,-HL if the tag of the preceding word is 'NP-
                  |   HL'
   7   7   0   1  | NN -> VB if the tag of the preceding word is 'MD'
   7  13   6   0  | NN -> VB if the tag of the preceding word is 'TO'
   7   7   0   0  | NP-TL -> NP if the tag of words i+1...i+2 is 'NNS'
   7   7   0   0  | VBN -> VBD if the tag of the preceding word is
                  |   'NP'`

您需要将输入文件（训练和测试）解析为NLTK工具链可以识别的格式：文件是句子列表（或序列），句子是标记单词列表，标记单词是两个字符串的元组，（单词，标记）
。在您的代码中，malay_taged
是一个简单的字符串（即一个字符序列）
自己动手并不难，但是NLTK的NLTK.corpus.reader.TaggedCorpusReader
可以为您解析文件。只需确保告诉它文件中的单词标记分隔符是反斜杠（“\\”
）。哦，编码似乎没有格式化。是的。谢谢sundar nataraj。我是新来的。：）输入看起来像什么？十句话（如果我们正在看的话）是一个非常小的训练语料库。哦，我只是想先做一个非常简单的系统。因为我尝试了很多次，但它仍然无法运行。。Gaya\NN hidup\NN sihat\VB boleh\MD lah\UH ditakrifkan\VBZ sebagai\DT satu\CD amalan\VBZ kehidupan\NN yang\DT membawa\VBZ impak\NN positif\NN
Training Brill tagger on 500 sentences...
Finding initial useful rules...
Found 10210 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  46  46   0   0  | TO -> IN if the tag of the following word is 'AT'
  18  20   2   0  | TO -> IN if the tag of words i+1...i+3 is 'CD'
  14  14   0   0  | IN -> IN-TL if the tag of the preceding word is
                  |   'NN-TL', and the tag of the following word is
                  |   'NN-TL'
  11  11   0   1  | TO -> IN if the tag of the following word is 'NNS'
  10  10   0   0  | TO -> IN if the tag of the following word is 'JJ'
   8   8   0   0  | , -> ,-HL if the tag of the preceding word is 'NP-
                  |   HL'
   7   7   0   1  | NN -> VB if the tag of the preceding word is 'MD'
   7  13   6   0  | NN -> VB if the tag of the preceding word is 'TO'
   7   7   0   0  | NP-TL -> NP if the tag of words i+1...i+2 is 'NNS'
   7   7   0   0  | VBN -> VBD if the tag of the preceding word is
                  |   'NP'`