Nlp 在两个文本文件中混合行对_Nlp_Shuffle_Training Data

Nlp 在两个文本文件中混合行对

nlp

Nlp 在两个文本文件中混合行对,nlp,shuffle,training-data,Nlp,Shuffle,Training Data,我正在做一个机器翻译项目，在这个项目中，我有450万行两种语言的文本。在将数据分割成碎片之前，我想对这些行进行洗牌，然后在碎片上训练我的模型。我知道所描述的shuf命令允许对一个文件中的行进行洗牌，但如何确保第二个文件中的相应行也按相同顺序洗牌？是否有命令在两个文件中洗牌行？TL；博士粘贴将两个文件中的单独列创建到单个文件中 shuf在单个文件上 cut拆分列粘贴 $ cat test.en a b c d e f g h i $ cat test.de 1 2 3 4 5 6 7

我正在做一个机器翻译项目，在这个项目中，我有450万行两种语言的文本。在将数据分割成碎片之前，我想对这些行进行洗牌，然后在碎片上训练我的模型。我知道所描述的

shuf

命令允许对一个文件中的行进行洗牌，但如何确保第二个文件中的相应行也按相同顺序洗牌？是否有命令在两个文件中洗牌行？

TL；博士

```
粘贴
```
将两个文件中的单独列创建到单个文件中
```
shuf
```
在单个文件上
```
cut
```
拆分列

粘贴

$ cat test.en 
a b c
d e f
g h i

$ cat test.de 
1 2 3
4 5 6
7 8 9

$ paste test.en test.de > test.en-de

$ cat test.en-de
a b c   1 2 3
d e f   4 5 6
g h i   7 8 9

$ shuf test.en-de > test.en-de.shuf

$ cat test.en-de.shuf
d e f   4 5 6
a b c   1 2 3
g h i   7 8 9

洗牌

$ cat test.en 
a b c
d e f
g h i

$ cat test.de 
1 2 3
4 5 6
7 8 9

$ paste test.en test.de > test.en-de

$ cat test.en-de
a b c   1 2 3
d e f   4 5 6
g h i   7 8 9

$ shuf test.en-de > test.en-de.shuf

$ cat test.en-de.shuf
d e f   4 5 6
a b c   1 2 3
g h i   7 8 9

切割

$ cut -f1 test.en-de.shuf> test.en-de.shuf.en
$ cut -f2 test.en-de.shuf> test.en-de.shuf.de

$ cat test.en-de.shuf.en 
d e f
a b c
g h i

$ cat test.en-de.shuf.de
4 5 6
1 2 3
7 8 9