如何为python NLTK构建翻译语料库？_Python_Python 3.x_Nltk_Corpus

如何为python NLTK构建翻译语料库？

python python-3.x

如何为python NLTK构建翻译语料库？,python,python-3.x,nltk,corpus,Python,Python 3.x,Nltk,Corpus,我一直在使用Python的NTLK进行通用语言解析，最近我想创建一个专门用于翻译的语料库。我一直无法理解NTLK用于翻译的语料库选项和结构有很多，但我找不到任何关于创建翻译风格语料库的细节。通过浏览语料库参考资料，我了解到有多种风格和类型，但我似乎找不到任何特定于翻译的语料库示例或文档。对于类似翻译的数据集，NLTK可以使用。这些文件必须具有以下格式： first source sentence first target sentence first alignment second sou

我一直在使用Python的NTLK进行通用语言解析，最近我想创建一个专门用于翻译的语料库。我一直无法理解NTLK用于翻译的语料库选项和结构

有很多，但我找不到任何关于创建翻译风格语料库的细节。通过浏览语料库参考资料，我了解到有多种风格和类型，但我似乎找不到任何特定于翻译的语料库示例或文档。

对于类似翻译的数据集，NLTK可以使用。这些文件必须具有以下格式：

first source sentence
first target sentence 
first alignment
second source sentence
second target sentence
second alignment

这意味着代词被假定为由空格分隔，句子以单独的行开始。例如，假设您有如下目录结构：

reader.py
data/en-es.txt
data/en-pt.txt

其中，文件的内容包括：

# en-es.txt
This is an example
Esto es un ejemplo
0-0 1-1 2-2 3-3

及

您可以使用以下脚本加载此玩具示例：

# reader.py    
from nltk.corpus.reader.aligned import AlignedCorpusReader

reader = AlignedCorpusReader('./data', '.*', '.txt', encoding='utf-8')

for sentence in reader.aligned_sents():
    print(sentence.words)
    print(sentence.mots)
    print(sentence.alignment)

输出

['This', 'is', 'an', 'example']
['Esto', 'es', 'un', 'ejemplo']
0-0 1-1 2-2 3-3
['This', 'is', 'an', 'example']
['Esto', 'é', 'um', 'exemplo']
0-0 1-1 2-2 3-3

行

reader=AlignedCorpusReader（'./data'，'.'.''.'和'.txt'，encoding='utf-8'）

创建

AlignedCorpusReader

的实例，该实例读取'./data'目录中以

'.txt'

结尾的所有文件。它还指定文件的编码为

'utf-8'

。

AlignedCorpusReader

的其他参数包括

word\u标记器

和

sent\u标记器

，

word\u标记器

设置为

WhitespaceTokenizer（）

，

sent\u标记器

设置为

RegexpTokenizer（'\n'，gaps=True）

更多信息可在文档（和）中找到。

谢谢。作为一名试图帮助他人为一种不常见的语言构建模型的程序员，这对我学习术语和技术来说是一个良好的开端。

['This', 'is', 'an', 'example']
['Esto', 'es', 'un', 'ejemplo']
0-0 1-1 2-2 3-3
['This', 'is', 'an', 'example']
['Esto', 'é', 'um', 'exemplo']
0-0 1-1 2-2 3-3