如何保存Python NLTK对齐模型以供以后使用？_Python_Io_Nlp_Nltk_Machine Translation

如何保存Python NLTK对齐模型以供以后使用？

python io nlp

如何保存Python NLTK对齐模型以供以后使用？,python,io,nlp,nltk,machine-translation,Python,Io,Nlp,Nltk,Machine Translation,在Python中，我使用创建平行文本之间的单词对齐。对齐Bitext可能是一个耗时的过程，尤其是在大量语料库上完成时。这将是很好的一天做批量对齐，并使用这些路线以后 from nltk import IBMModel1 as ibm biverses = [list of AlignedSent objects] model = ibm(biverses, 20) with open(path + "eng-taq_model.txt", 'w') as f: f.write(mode

在Python中，我使用创建平行文本之间的单词对齐。对齐Bitext可能是一个耗时的过程，尤其是在大量语料库上完成时。这将是很好的一天做批量对齐，并使用这些路线以后

from nltk import IBMModel1 as ibm
biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

with open(path + "eng-taq_model.txt", 'w') as f:
    f.write(model.train(biverses, 20))  // makes empty file

创建模型后，我如何（1）将其保存到磁盘并（2）以后重新使用它？

立即解决的办法是对其进行pickle处理，请参阅

但由于IBMModel1返回lambda函数，因此无法使用默认的

pickle

cPickle

（请参阅和）

因此，我们将使用

dill

。首先，安装dill，请参见

然后：

要使用pickled模型，请执行以下操作：

>>> import dill as pickle
>>> from nltk.corpus import comtrans
>>> bitexts = comtrans.aligned_sents()[:100]
>>> with open('model1.pk', 'rb') as fin:
...     ibm = pickle.load(fin)
... 
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

如果尝试pickle

IBMModel1

对象，这是一个lambda函数，您将得到以下结果：

>>> import cPickle as pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('model1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/python2.7/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle function objects

>>将cPickle作为pickle导入
>>>从nltk.corpus导入comtrans
>>>从nltk.align导入IBMModel1
>>>bitexts=comtrans.aligned_sents（）[：100]
>>>ibm=IBMModel1（bitexts，20）
>>>以open（'model1.pk'，'wb'）作为fout：
...     pickle.dump（ibm、fout）
... 
回溯（最近一次呼叫最后一次）：
文件“”，第2行，在
文件“/usr/lib/python2.7/copy_reg.py”，第70行，在
raise TypeError，“无法pickle%s对象”%base.\u\n__
TypeError:无法pickle函数对象

（注意：以上代码段来自NLTK版本3.0.0）

在使用NLTK 3.0.0的python3中，您还将面临相同的问题，因为IBMModel1返回lambda函数：

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> from nltk.corpus import comtrans
>>> from nltk.align import IBMModel1
>>> bitexts = comtrans.aligned_sents()[:100]
>>> ibm = IBMModel1(bitexts, 20)
>>> with open('mode1.pk', 'wb') as fout:
...     pickle.dump(ibm, fout)
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
_pickle.PicklingError: Can't pickle <function IBMModel1.train.<locals>.<lambda> at 0x7fa37cf9d620>: attribute lookup <lambda> on nltk.align.ibm1 failed'

>>> import dill
>>> with open('model1.pk', 'wb') as fout:
...     dill.dump(ibm, fout)
... 
>>> exit()

alvas@ubi:~$ python3
Python 3.4.0 (default, Apr 11 2014, 13:05:11) 
[GCC 4.8.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import dill
>>> from nltk.corpus import comtrans
>>> with open('model1.pk', 'rb') as fin:
...     ibm = dill.load(fin)
... 
>>> bitexts = comtrans.aligned_sents()[:100]
>>> aligned_sent = ibm.aligned(bitexts[0])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'IBMModel1' object has no attribute 'aligned'
>>> aligned_sent = ibm.align(bitexts[0])
>>> aligned_sent.words
['Wiederaufnahme', 'der', 'Sitzungsperiode']

alvas@ubi：~$python3
Python 3.4.0（默认值，2014年4月11日，13:05:11）
[GCC 4.8.2]在linux上
有关详细信息，请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>进口泡菜
>>>从nltk.corpus导入comtrans
>>>从nltk.align导入IBMModel1
>>>bitexts=comtrans.aligned_sents（）[：100]
>>>ibm=IBMModel1（bitexts，20）
>>>以open（'mode1.pk'，'wb'）作为fout：
...     pickle.dump（ibm、fout）
... 
回溯（最近一次呼叫最后一次）：
文件“”，第2行，在
_pickle.PicklingError:无法pickle:nltk.align.ibm1上的属性查找失败'
>>>进口莳萝
>>>以open（'model1.pk'，'wb'）作为fout：
...     dill.dump（ibm，fout）
... 
>>>退出（）
alvas@ubi：~$python3
Python 3.4.0（默认值，2014年4月11日，13:05:11）
[GCC 4.8.2]在linux上
有关详细信息，请键入“帮助”、“版权”、“信用证”或“许可证”。
>>>进口莳萝
>>>从nltk.corpus导入comtrans
>>>以open（'model1.pk'，'rb'）作为fin：
...     ibm=dill.load（fin）
... 
>>>bitexts=comtrans.aligned_sents（）[：100]
>>>aligned_sent=ibm.aligned（bitexts[0]）
回溯（最近一次呼叫最后一次）：
文件“”，第1行，在
AttributeError:“IBMModel1”对象没有“对齐”属性
>>>aligned\u sent=ibm.align（bitexts[0]）
>>>对齐的单词
['Wiederaufnahme'，'der'，'Sitzungsperiode']

（注意：在python3中，

pickle

是

cPickle

，请参见）

您讨论了保存对齐器模型，但您的问题似乎更多地是关于保存已对齐的对齐位文本：“最好在一天内批量进行对齐，然后在以后使用这些对齐方式。”我将回答这个问题

在nltk环境中，使用类似语料库的资源的最佳方法是使用语料库读取器访问它。NLTK没有语料库编写器，但是NLTK的

AlignedCorpusReader

支持的格式非常容易生成：（NLTK 3版本）

就这样。您可以稍后重新加载并使用对齐的句子，就像使用

comtrans

语料库一样：

from nltk.corpus.reader import AlignedCorpusReader

mycorpus = AlignedCorpusReader(r"folder", r".*\.txt")
biverses_reloaded = mycorpus.aligned_sents()

如您所见，不需要对齐器对象本身。对齐的句子可以通过语料库读取器加载，除非你愿意，否则对齐器本身是毫无用处的研究嵌入概率

注释：我不确定是否将对齐器对象称为“模型”。在NLTK 2中，对齐器没有设置为对齐新文本——它甚至没有

align（）

方法。在NLTK 3中，函数

align（）

可以对齐新文本，但只能从python 2中使用在Python3中，它被打破了，显然是因为比较不同类型对象的规则更加严格。不过，如果您希望能够pickle并重新加载对齐器，我很乐意将其添加到我的答案中；据我所知，如果您愿意，可以使用vanilla

cPickle

完成此操作，而且它看起来很像，您可以将其存储为对齐的发送列表：

from nltk.align import IBMModel1 as IBM
from nltk.align import AlignedSent
import dill as pickle

biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

for sent in range(len(biverses)):
     biverses[sent].alignment = model.align(biverses[sent]).alignment

之后，您可以将其与莳萝一起保存为泡菜：

with open('alignedtext.pk', 'wb') as arquive:
     pickle.dump(biverses, arquive)

我不知道你试过什么，但我没有看到lambdas，也没有问题用香草泡菜来腌制和解腌“模型”。@alexis，这很有趣，你有没有得到与更新答案相同的错误？还没有机会尝试；但是我可能已经用Python2测试了pickling，这可以解释不同的体验（我还没有意识到模块已经改变了这么多）。当我尝试它时，我会让你知道的。我用python 3又看了一眼。类构造函数不返回lambda函数，也不返回

train（）

。但是该模型存储在使用lambda定义的defaultdict中（以通常的方式），并且不能对使用lambda的defaultdict进行pickle处理。该类可以很容易地设置为可拾取的，但必须修改模块源代码。（只需使用模块局部函数而不是lambdas。）对齐器函数是一个模型，因为它学习给定源语言单词的每个目标语言单词的概率。虽然可以将其存储为一个大的哈希表，但代码的作者决定将其存储为一个返回defaultdict的lambda函数。因此，在已知概率的情况下，有可能将概率分配给新数据，这就是为什么它被称为模型。然而，我同意你的观点，保存模型是不自然的，因为给定新数据，你可以简单地重建概率模型。请参阅以获取理论解释。顺便说一句，

nltk.align

在python3中没有被破坏。您使用的是哪个版本？我明白了

from nltk.corpus.reader import AlignedCorpusReader

mycorpus = AlignedCorpusReader(r"folder", r".*\.txt")
biverses_reloaded = mycorpus.aligned_sents()

from nltk.align import IBMModel1 as IBM
from nltk.align import AlignedSent
import dill as pickle

biverses = [list of AlignedSent objects]
model = ibm(biverses, 20)

for sent in range(len(biverses)):
     biverses[sent].alignment = model.align(biverses[sent]).alignment

with open('alignedtext.pk', 'wb') as arquive:
     pickle.dump(biverses, arquive)