Google colaboratory GoogleColab中的语句块_Google Colaboratory_Tokenize_Machine Translation_Opennmt_Sentencepiece

Google colaboratory GoogleColab中的语句块

google-colaboratory

Google colaboratory GoogleColab中的语句块,google-colaboratory,tokenize,machine-translation,opennmt,sentencepiece,Google Colaboratory,Tokenize,Machine Translation,Opennmt,Sentencepiece,我想使用sentencepiece，在一个Google Colab项目中，我正在培训一个OpenNMT模型。我对如何在GoogleColab中设置sentencepiece二进制文件有点困惑。我需要用cmake构建吗当我尝试使用pip install sentencepiece进行安装，并尝试在脚本中的“转换”中包含sentencepiece时，我会遇到以下错误运行此脚本后（匹配自OpenNMT翻译教程）！onmt_build_vocab-config en-sp.yaml-n_sampl

我想使用sentencepiece，在一个Google Colab项目中，我正在培训一个OpenNMT模型。我对如何在GoogleColab中设置sentencepiece二进制文件有点困惑。我需要用cmake构建吗

当我尝试使用

pip install sentencepiece

进行安装，并尝试在脚本中的“转换”中包含sentencepiece时，我会遇到以下错误

运行此脚本后（匹配自OpenNMT翻译教程）

！onmt_build_vocab-config en-sp.yaml-n_sample-1

我得到：

Traceback (most recent call last):
  File "/usr/local/bin/onmt_build_vocab", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 63, in main
    build_vocab_main(opts)
  File "/usr/local/lib/python3.7/dist-packages/onmt/bin/build_vocab.py", line 32, in build_vocab_main
    transforms = make_transforms(opts, transforms_cls, fields)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/transform.py", line 176, in make_transforms
    transform_obj.warm_up(vocabs)
  File "/usr/local/lib/python3.7/dist-packages/onmt/transforms/tokenize.py", line 110, in warm_up
    load_src_model.Load(self.src_subword_model)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 367, in Load
    return self.LoadFromFile(model_file)
  File "/usr/local/lib/python3.7/dist-packages/sentencepiece/__init__.py", line 171, in LoadFromFile
    return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
TypeError: not a string

编辑：所以我继续用谷歌搜索这个问题，发现了一个谷歌colab项目，它在这里使用cmake构建了sentencepiece。然而，即使在使用cmake构建之后，我仍然会遇到这个问题。

要解决这个问题，我必须过滤并标记我的数据集，然后使用sentencepiece进行训练。我使用了这个有用的来源的脚本：做任何事情，现在我的模型正在训练

## Where the samples will be written
save_data: en-sp/run/example

## Where the vocab(s) will be written
src_vocab: en-sp/run/example.vocab.src
tgt_vocab: en-sp/run/example.vocab.tgt

## Where the model will be saved
save_model: drive/MyDrive/Europarl/model/model

# Prevent overwriting existing files in the folder
overwrite: False

# Corpus opts:
data:
    europarl:
        path_src: train_europarl-v7.es-en.es
        path_tgt: train_europarl-v7.es-en.en
        transforms: [sentencepiece, filtertoolong]
        weight: 1

    valid:
        path_src: dev_europarl-v7.es-en.es
        path_tgt: dev_europarl-v7.es-en.en
        transforms: [sentencepiece]

skip_empty_level: silent

world_size: 1
gpu_ranks: [0]
...