Nlp 如何用庞大的语言模型调整机器翻译模型?

Nlp 如何用庞大的语言模型调整机器翻译模型?,nlp,n-gram,machine-translation,moses,language-model,Nlp,N Gram,Machine Translation,Moses,Language Model,Moses是一种用于构建机器翻译模型的软件。而KenLM是摩西使用的实际语言模型软件 我有一个16GB的文本文件,我用它来构建一个语言模型: bin/lmplz -o 5 <text > text.arpa bin/build_binary text.arpa text.binary 二进制语言模型(text.binary)增长到71GB 在moses中,训练翻译模型后,应使用MERT算法调整模型的权重。这可以简单地用计算机来完成 MERT可以很好地处理小型语言模型,但对于大型语

Moses
是一种用于构建机器翻译模型的软件。而
KenLM
是摩西使用的实际语言模型软件

我有一个16GB的文本文件,我用它来构建一个语言模型:

bin/lmplz -o 5 <text > text.arpa
bin/build_binary text.arpa text.binary
二进制语言模型(
text.binary
)增长到71GB

moses
中,训练翻译模型后,应使用
MERT
算法调整模型的权重。这可以简单地用计算机来完成

MERT可以很好地处理小型语言模型,但对于大型语言模型,需要花费相当多的时间才能完成

我在谷歌上搜索了一下,找到了KenLM的过滤器,它承诺将语言模型过滤到更小的尺寸:

但我不知道如何让它工作。命令帮助提供:

$ ~/moses/bin/filter
Usage: /home/alvas/moses/bin/filter mode [context] [phrase] [raw|arpa] [threads:m] [batch_size:m] (vocab|model):input_file output_file

copy mode just copies, but makes the format nicer for e.g. irstlm's broken
    parser.
single mode treats the entire input as a single sentence.
multiple mode filters to multiple sentences in parallel.  Each sentence is on
    a separate line.  A separate file is created for each sentence by appending
    the 0-indexed line number to the output file name.
union mode produces one filtered model that is the union of models created by
    multiple mode.

context means only the context (all but last word) has to pass the filter, but
    the entire n-gram is output.

phrase means that the vocabulary is actually tab-delimited phrases and that the
    phrases can generate the n-gram when assembled in arbitrary order and
    clipped.  Currently works with multiple or union mode.

The file format is set by [raw|arpa] with default arpa:
raw means space-separated tokens, optionally followed by a tab and arbitrary
    text.  This is useful for ngram count files.
arpa means the ARPA file format for n-gram language models.

threads:m sets m threads (default: conccurrency detected by boost)
batch_size:m sets the batch size for threading.  Expect memory usage from this
    of 2*threads*batch_size n-grams.

There are two inputs: vocabulary and model.  Either may be given as a file
    while the other is on stdin.  Specify the type given as a file using
    vocab: or model: before the file name.  

For ARPA format, the output must be seekable.  For raw format, it can be a
    stream i.e. /dev/stdout
但当我尝试以下操作时,它卡住了,什么也没做:

$ ~/moses/bin/filter union lm.en.binary lm.filter.binary
Assuming that lm.en.binary is a model file
Reading lm.en.binary
----5---10---15---20---25---30---35---40---45---50---55---60---65---70---75---80---85---90---95--100
二值化后,应该对语言模型做什么?是否有其他步骤来操作大型语言模型以减少 调整时计算负载?

对大型LM文件进行调优的通常方法是什么?

如何使用KenLM的过滤器?


(更多详细信息)

回答如何使用


注意:
single
可以替换为
union
copy
。如果您在没有参数的情况下运行
过滤器
二进制文件,请阅读正在打印的帮助中的更多内容。

您确定是语言模型使MERT如此缓慢吗?我对SMT很陌生,但出于某种原因,我希望翻译模型的大小会有更大的问题。这可以通过给定input.pl的
训练/过滤模型来解决。是的,正是大型语言模型让MERT变慢了。我试过各种尺寸的LM。
cat small_vocabulary_one_word_per_line.txt \
  | filter single \
         "model:LM_large_vocab.arpa" \
          output_LM_small_vocab.