Unix 基于中的两列进行的两个文件的比较,但保留两条带图案的重复行

Unix 基于中的两列进行的两个文件的比较,但保留两条带图案的重复行,unix,awk,Unix,Awk,文件1: 文件2: scaffold2232_size19577 gene 8878 9258 scaffold2232_size19577 CDS 8878 9258 scaffold2232_size19577 gene 10631 14562 scaffold2232_size19577 intron 10693 11242 scaffold2232_size19577 intron 11343

文件1:

文件2:

scaffold2232_size19577   gene       8878    9258
scaffold2232_size19577   CDS        8878    9258
scaffold2232_size19577   gene       10631   14562
scaffold2232_size19577   intron     10693   11242
scaffold2232_size19577   intron     11343   14252
scaffold2232_size19577   intron     14346   14499
scaffold2232_size19577   CDS        10631   10692
scaffold2232_size19577   CDS        11243   11342
scaffold2232_size19577   CDS        14253   14345
scaffold2232_size19577   CDS        14500   14562
scaffold2232_size19577   gene       18807   19055
scaffold2232_size19577   CDS        18807   19055
期望输出:

scaffold2232_size19577   8878   9258    Os12t0508300-01
scaffold2232_size19577   8878   9258    Os12t0508300-01
scaffold2232_size19577   10631  14562   Os12t0508300-01
scaffold2232_size19577   10693  11242   Os12t0508300-01
scaffold2232_size19577   11343  14252   Os12t0508300-01
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   10631  10692   Os12t0508300-01
scaffold2232_size19577   11243  11342   Os12t0508300-01
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00
我试着做:
awk'{a[$1,$2,$3]=$0}{END{for(a中的i)打印[i]}'文件2

但有了这个,我失去了一条基因/CDS线,因为它们在列[2],[3]中有相同的坐标 因此,输出即将到来:

scaffold2232_size19577   8878   9258    Os12t0508300-01 gene
scaffold2232_size19577   8878   9258    Os12t0508300-01 CDS 
scaffold2232_size19577   10631  14562   Os12t0508300-01 gene
scaffold2232_size19577   10693  11242   Os12t0508300-01 intron
scaffold2232_size19577   11343  14252   Os12t0508300-01 intron
scaffold2232_size19577   14346  14499   Os12t0508400-00 intron
scaffold2232_size19577   10631  10692   Os12t0508300-01 CDS
scaffold2232_size19577   11243  11342   Os12t0508300-01 CDS
scaffold2232_size19577   14253  14345   Os12t0508400-00 CDS
scaffold2232_size19577   14500  14562   Os12t0508400-00 CDS
scaffold2232_size19577   18807  19055   Os12t0508400-00 gene
scaffold2232_size19577   18807  19055   Os12t0508400-00 CDS
我想我以后可以将file1的col[2]添加到file2中,但是在这个awk操作之后,行数减少了,所以我无法添加它们。 我希望这是我想要的输出。

像这样的东西

scaffold2232_size19577    8878  9258    Os12t0508300-01 
scaffold2232_size19577   10631  14562   Os12t0508300-01 
scaffold2232_size19577   10693  11242   Os12t0508300-01
scaffold2232_size19577   11343  14252   Os12t0508300-01
scaffold2232_size19577   14346  14499   Os12t0508400-00
scaffold2232_size19577   10631  10692   Os12t0508300-01
scaffold2232_size19577   11243  11342   Os12t0508300-01
scaffold2232_size19577   14253  14345   Os12t0508400-00
scaffold2232_size19577   14500  14562   Os12t0508400-00
scaffold2232_size19577   18807  19055   Os12t0508400-00

文件
tab
是分开的吗?是的,它是分开的。在构造复合键时,实际上不需要
FS
。这将很好地工作
awk'FNR==NR{a[$2,$3]=$4;接下来}{print$1,$3,$4,a[$3,$4],$2}'OFS=“\t”f2 f1
,看起来也很干净。@jaypal我因分离不好而烧伤了手指,发现
FS
效果很好:)@Jotne
实际上使用
subsp
进行分离,这是一个非打印字符,可确保多个键的安全分离。有关详细信息,请参阅。@jaypal感谢提供信息。我将在下一个十字路口看到这一点:)
awk 'FNR==NR {a[$2FS$3]=$4;next} {print $1,$3,$4,a[$3FS$4],$2}' OFS="\t" f2 f1
scaffold2232_size19577  8878    9258    Os12t0508300-01 gene
scaffold2232_size19577  8878    9258    Os12t0508300-01 CDS
scaffold2232_size19577  10631   14562   Os12t0508300-01 gene
scaffold2232_size19577  10693   11242   Os12t0508300-01 intron
scaffold2232_size19577  11343   14252   Os12t0508300-01 intron
scaffold2232_size19577  14346   14499   Os12t0508400-00 intron
scaffold2232_size19577  10631   10692   Os12t0508300-01 CDS
scaffold2232_size19577  11243   11342   Os12t0508300-01 CDS
scaffold2232_size19577  14253   14345   Os12t0508400-00 CDS
scaffold2232_size19577  14500   14562   Os12t0508400-00 CDS
scaffold2232_size19577  18807   19055   Os12t0508400-00 gene
scaffold2232_size19577  18807   19055   Os12t0508400-00 CDS