Unix 基于中的两列进行的两个文件的比较,但保留两条带图案的重复行
文件1: 文件2:Unix 基于中的两列进行的两个文件的比较,但保留两条带图案的重复行,unix,awk,Unix,Awk,文件1: 文件2: scaffold2232_size19577 gene 8878 9258 scaffold2232_size19577 CDS 8878 9258 scaffold2232_size19577 gene 10631 14562 scaffold2232_size19577 intron 10693 11242 scaffold2232_size19577 intron 11343
scaffold2232_size19577 gene 8878 9258
scaffold2232_size19577 CDS 8878 9258
scaffold2232_size19577 gene 10631 14562
scaffold2232_size19577 intron 10693 11242
scaffold2232_size19577 intron 11343 14252
scaffold2232_size19577 intron 14346 14499
scaffold2232_size19577 CDS 10631 10692
scaffold2232_size19577 CDS 11243 11342
scaffold2232_size19577 CDS 14253 14345
scaffold2232_size19577 CDS 14500 14562
scaffold2232_size19577 gene 18807 19055
scaffold2232_size19577 CDS 18807 19055
期望输出:
scaffold2232_size19577 8878 9258 Os12t0508300-01
scaffold2232_size19577 8878 9258 Os12t0508300-01
scaffold2232_size19577 10631 14562 Os12t0508300-01
scaffold2232_size19577 10693 11242 Os12t0508300-01
scaffold2232_size19577 11343 14252 Os12t0508300-01
scaffold2232_size19577 14346 14499 Os12t0508400-00
scaffold2232_size19577 14346 14499 Os12t0508400-00
scaffold2232_size19577 14346 14499 Os12t0508400-00
scaffold2232_size19577 10631 10692 Os12t0508300-01
scaffold2232_size19577 11243 11342 Os12t0508300-01
scaffold2232_size19577 14253 14345 Os12t0508400-00
scaffold2232_size19577 14253 14345 Os12t0508400-00
scaffold2232_size19577 14253 14345 Os12t0508400-00
scaffold2232_size19577 14500 14562 Os12t0508400-00
scaffold2232_size19577 14500 14562 Os12t0508400-00
scaffold2232_size19577 14500 14562 Os12t0508400-00
scaffold2232_size19577 18807 19055 Os12t0508400-00
scaffold2232_size19577 18807 19055 Os12t0508400-00
scaffold2232_size19577 18807 19055 Os12t0508400-00
scaffold2232_size19577 18807 19055 Os12t0508400-00
scaffold2232_size19577 18807 19055 Os12t0508400-00
scaffold2232_size19577 18807 19055 Os12t0508400-00
我试着做:awk'{a[$1,$2,$3]=$0}{END{for(a中的i)打印[i]}'文件2
但有了这个,我失去了一条基因/CDS线,因为它们在列[2],[3]中有相同的坐标
因此,输出即将到来:
scaffold2232_size19577 8878 9258 Os12t0508300-01 gene
scaffold2232_size19577 8878 9258 Os12t0508300-01 CDS
scaffold2232_size19577 10631 14562 Os12t0508300-01 gene
scaffold2232_size19577 10693 11242 Os12t0508300-01 intron
scaffold2232_size19577 11343 14252 Os12t0508300-01 intron
scaffold2232_size19577 14346 14499 Os12t0508400-00 intron
scaffold2232_size19577 10631 10692 Os12t0508300-01 CDS
scaffold2232_size19577 11243 11342 Os12t0508300-01 CDS
scaffold2232_size19577 14253 14345 Os12t0508400-00 CDS
scaffold2232_size19577 14500 14562 Os12t0508400-00 CDS
scaffold2232_size19577 18807 19055 Os12t0508400-00 gene
scaffold2232_size19577 18807 19055 Os12t0508400-00 CDS
我想我以后可以将file1的col[2]添加到file2中,但是在这个awk操作之后,行数减少了,所以我无法添加它们。
我希望这是我想要的输出。像这样的东西
scaffold2232_size19577 8878 9258 Os12t0508300-01
scaffold2232_size19577 10631 14562 Os12t0508300-01
scaffold2232_size19577 10693 11242 Os12t0508300-01
scaffold2232_size19577 11343 14252 Os12t0508300-01
scaffold2232_size19577 14346 14499 Os12t0508400-00
scaffold2232_size19577 10631 10692 Os12t0508300-01
scaffold2232_size19577 11243 11342 Os12t0508300-01
scaffold2232_size19577 14253 14345 Os12t0508400-00
scaffold2232_size19577 14500 14562 Os12t0508400-00
scaffold2232_size19577 18807 19055 Os12t0508400-00
文件
tab
是分开的吗?是的,它是分开的。在构造复合键时,实际上不需要FS
。这将很好地工作awk'FNR==NR{a[$2,$3]=$4;接下来}{print$1,$3,$4,a[$3,$4],$2}'OFS=“\t”f2 f1
,看起来也很干净。@jaypal我因分离不好而烧伤了手指,发现FS
效果很好:)@Jotne,
实际上使用subsp
进行分离,这是一个非打印字符,可确保多个键的安全分离。有关详细信息,请参阅。@jaypal感谢提供信息。我将在下一个十字路口看到这一点:)
awk 'FNR==NR {a[$2FS$3]=$4;next} {print $1,$3,$4,a[$3FS$4],$2}' OFS="\t" f2 f1
scaffold2232_size19577 8878 9258 Os12t0508300-01 gene
scaffold2232_size19577 8878 9258 Os12t0508300-01 CDS
scaffold2232_size19577 10631 14562 Os12t0508300-01 gene
scaffold2232_size19577 10693 11242 Os12t0508300-01 intron
scaffold2232_size19577 11343 14252 Os12t0508300-01 intron
scaffold2232_size19577 14346 14499 Os12t0508400-00 intron
scaffold2232_size19577 10631 10692 Os12t0508300-01 CDS
scaffold2232_size19577 11243 11342 Os12t0508300-01 CDS
scaffold2232_size19577 14253 14345 Os12t0508400-00 CDS
scaffold2232_size19577 14500 14562 Os12t0508400-00 CDS
scaffold2232_size19577 18807 19055 Os12t0508400-00 gene
scaffold2232_size19577 18807 19055 Os12t0508400-00 CDS