awk根据与拆分匹配的行更新文件
在下面的awk根据与拆分匹配的行更新文件,awk,Awk,在下面的awk中,我试图将file1中的$2与file2中的$4匹配到中的第一个未得分。如果找到匹配项,则file2的该部分将更新为file1中匹配的$1值。我认为这很接近,但不确定如何解释文件1中的。在我的真实数据中有数千行,但它们都是以下格式,并且可能并不总是找到匹配项。awk按原样执行,但是file2没有更新,我想是因为不匹配。谢谢:) 文件1空格分隔 TGFBR1 NM_004612.3 TGFBR2 NM_003242.5 TGFBR3 NM_003243.4 chr1 921
awk
中,我试图将file1
中的$2
与file2中的$4
匹配到中的第一个未得分
。如果找到匹配项,则file2
的该部分将更新为file1
中匹配的$1
值。我认为这很接近,但不确定如何解释文件1
中的
。在我的真实数据中有数千行,但它们都是以下格式,并且可能并不总是找到匹配项。awk
按原样执行,但是file2
没有更新,我想是因为
不匹配。谢谢:)
文件1空格分隔
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
文件2制表符分隔
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
所需输出制表符分隔
TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
chr1 92149295 92149414 NM_003243_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 NM_003243_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 NM_003243_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 NM_003242_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 NM_003242_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 NM_004612_cds_0_0_chr9_101867488_f
chr9 101904817 101904985 NM_001130916_cds_3_0_chr9_101904818_f
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
awk
awk 'FNR==NR {A[$1]=$1; next} $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2
下面的awk
可能会对您有所帮助。此外,您还可以根据输入文件更改FS
字段分隔符,例如-->输入文件1以空格分隔,然后在其前面使用FS=”“
,输入文件2以制表符分隔,然后在其名称前面使用FS=“\t”
awk '
FNR==NR{
val=$2;
sub(/\..*/,"",val);
a[val]=$1;
next
}
{
split($4,array,"_")
}
((array[1]"_"array[2]) in a){
sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
print
}
' Input_file1 Input_file2
输出如下:
chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f
我现在发布了解决方案,我的解决方案认为您的字段分隔符是空间分隔符(因为您现在更新了它们),但我也添加了关于如何处理不同类型分隔符文件的说明,让我知道。效果很好,但在编辑我时,您比我快,如果我不想删除不匹配的行,但是,不是全部打印,而是在所有行的打印后指定为print$0
?谢谢您的帮助:)。子项是否仅打印匹配项以便打印匹配项,而非匹配项是否必须添加条件到数组[1]。也就是说,如果数组[1]中没有匹配项,则该行按原样打印?谢谢:)。@Chris,不,它是(数组[1]“"数组[2])
检查它是否存在于数组a
中,然后它给出输出。@Chris,我以后也会在代码中添加解释。