Python 嗯,从泡菜里装的看起来没受过训练

Python 嗯,从泡菜里装的看起来没受过训练,python,nltk,pickle,Python,Nltk,Pickle,我正在尝试将nltk.tag.hmm.HiddenMarkovModelTagger序列化到pickle中,以便在需要时使用它,而无需重新培训。然而,从.pkl加载后,我的HMM看起来没有经过训练。我的两个问题是: 我做错了什么 序列化HMM是个好主意吗 当一个人有一个大的数据集时 代码如下: In [1]: import nltk In [2]: from nltk.probability import * In [3]: from nltk.util import unique_list

我正在尝试将nltk.tag.hmm.HiddenMarkovModelTagger序列化到pickle中,以便在需要时使用它,而无需重新培训。然而,从.pkl加载后,我的HMM看起来没有经过训练。我的两个问题是:

  • 我做错了什么
  • 序列化HMM是个好主意吗 当一个人有一个大的数据集时
  • 代码如下:

    In [1]: import nltk
    
    In [2]: from nltk.probability import *
    
    In [3]: from nltk.util import unique_list
    
    In [4]: import json
    
    In [5]: with open('data.json') as data_file:
       ...:         corpus = json.load(data_file)
       ...:     
    
    In [6]: corpus = [[tuple(l) for l in sentence] for sentence in corpus]
    
    In [7]: tag_set = unique_list(tag for sent in corpus for (word,tag) in sent)
    
    In [8]: symbols = unique_list(word for sent in corpus for (word,tag) in sent)
    
    In [9]: trainer = nltk.tag.HiddenMarkovModelTrainer(tag_set, symbols)
    
    In [10]: train_corpus = corpus[:4]
    
    In [11]: test_corpus = [corpus[4]]
    
    In [12]: hmm = trainer.train_supervised(train_corpus, estimator=LaplaceProbDist)
    
    In [13]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
    100.00%
    
    正如你所看到的,HMM是经过训练的。现在我腌制它:

    In [14]: import pickle
    
    In [16]: output = open('hmm.pkl', 'wb')
    
    In [17]: pickle.dump(hmm, output)
    
    In [18]: output.close()
    
    重置并加载后,模型看起来比一盒岩石还要笨:

    In [19]: %reset
    Once deleted, variables cannot be recovered. Proceed (y/[n])? y
    
    In [20]: import pickle
    
    In [21]: import json
    
    In [22]: with open('data.json') as data_file:
       ....:     corpus = json.load(data_file)
       ....:     
    
    In [23]: test_corpus = [corpus[4]]
    
    In [24]: pkl_file = open('hmm.pkl', 'rb')
    
    In [25]: hmm = pickle.load(pkl_file)
    
    In [26]: pkl_file.close()
    
    In [27]: type(hmm)
    Out[27]: nltk.tag.hmm.HiddenMarkovModelTagger
    
    In [28]: print('%.2f%%' % (100 * hmm.evaluate(test_corpus)))
    0.00%
    
    1) 在[22]之后,您需要添加-

    corpus = [[tuple(l) for l in sentence] for sentence in corpus]
    
    2) 每次为测试目的重新训练模型都会很耗时。
    因此,最好是pickle.dump您的模型并加载它。

    在[22]中之后,-corpus=[[tuple(l)表示句子中的l]表示句子中的句子]