Warning: file_get_contents(/data/phpspider/zhask/data//catemap/4/fsharp/3.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
awk根据与拆分匹配的行更新文件_Awk - Fatal编程技术网

awk根据与拆分匹配的行更新文件

awk根据与拆分匹配的行更新文件,awk,Awk,在下面的awk中,我试图将file1中的$2与file2中的$4匹配到中的第一个未得分。如果找到匹配项,则file2的该部分将更新为file1中匹配的$1值。我认为这很接近,但不确定如何解释文件1中的。在我的真实数据中有数千行,但它们都是以下格式,并且可能并不总是找到匹配项。awk按原样执行,但是file2没有更新,我想是因为不匹配。谢谢:) 文件1空格分隔 TGFBR1 NM_004612.3 TGFBR2 NM_003242.5 TGFBR3 NM_003243.4 chr1 921

在下面的
awk
中,我试图将
file1
中的
$2
file2中的
$4
匹配到
中的第一个未得分
。如果找到匹配项,则
file2
的该部分将更新为
file1
中匹配的
$1
值。我认为这很接近,但不确定如何解释
文件1
中的
。在我的真实数据中有数千行,但它们都是以下格式,并且可能并不总是找到匹配项。
awk
按原样执行,但是
file2
没有更新,我想是因为
不匹配。谢谢:)

文件1
空格分隔

TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
chr1    92149295    92149414    NM_003243_cds_0_0_chr1_92149296_r
chr1    92161228    92161336    NM_003243_cds_1_0_chr1_92161229_r
chr1    92163645    92163687    NM_003243_cds_2_0_chr1_92163646_r
chr3    30648375    30648469    NM_003242_cds_0_0_chr3_30648376_f
chr3    30686238    30686407    NM_003242_cds_1_0_chr3_30686239_f
chr9    101867487   101867584   NM_004612_cds_0_0_chr9_101867488_f
chr9    101904817   101904985   NM_001130916_cds_3_0_chr9_101904818_f
chr1    92149295    92149414    TGFBR3_cds_0_0_chr1_92149296_r
chr1    92161228    92161336    TGFBR3_cds_1_0_chr1_92161229_r
chr1    92163645    92163687    TGFBR3_cds_2_0_chr1_92163646_r
chr3    30648375    30648469    TGFBR2_cds_0_0_chr3_30648376_f
chr3    30686238    30686407    TGFBR2_cds_1_0_chr3_30686239_f
chr9    101867487   101867584   TGFBR1_cds_0_0_chr9_101867488_f
文件2
制表符分隔

TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
chr1    92149295    92149414    NM_003243_cds_0_0_chr1_92149296_r
chr1    92161228    92161336    NM_003243_cds_1_0_chr1_92161229_r
chr1    92163645    92163687    NM_003243_cds_2_0_chr1_92163646_r
chr3    30648375    30648469    NM_003242_cds_0_0_chr3_30648376_f
chr3    30686238    30686407    NM_003242_cds_1_0_chr3_30686239_f
chr9    101867487   101867584   NM_004612_cds_0_0_chr9_101867488_f
chr9    101904817   101904985   NM_001130916_cds_3_0_chr9_101904818_f
chr1    92149295    92149414    TGFBR3_cds_0_0_chr1_92149296_r
chr1    92161228    92161336    TGFBR3_cds_1_0_chr1_92161229_r
chr1    92163645    92163687    TGFBR3_cds_2_0_chr1_92163646_r
chr3    30648375    30648469    TGFBR2_cds_0_0_chr3_30648376_f
chr3    30686238    30686407    TGFBR2_cds_1_0_chr3_30686239_f
chr9    101867487   101867584   TGFBR1_cds_0_0_chr9_101867488_f
所需输出
制表符分隔

TGFBR1 NM_004612.3
TGFBR2 NM_003242.5
TGFBR3 NM_003243.4
chr1    92149295    92149414    NM_003243_cds_0_0_chr1_92149296_r
chr1    92161228    92161336    NM_003243_cds_1_0_chr1_92161229_r
chr1    92163645    92163687    NM_003243_cds_2_0_chr1_92163646_r
chr3    30648375    30648469    NM_003242_cds_0_0_chr3_30648376_f
chr3    30686238    30686407    NM_003242_cds_1_0_chr3_30686239_f
chr9    101867487   101867584   NM_004612_cds_0_0_chr9_101867488_f
chr9    101904817   101904985   NM_001130916_cds_3_0_chr9_101904818_f
chr1    92149295    92149414    TGFBR3_cds_0_0_chr1_92149296_r
chr1    92161228    92161336    TGFBR3_cds_1_0_chr1_92161229_r
chr1    92163645    92163687    TGFBR3_cds_2_0_chr1_92163646_r
chr3    30648375    30648469    TGFBR2_cds_0_0_chr3_30648376_f
chr3    30686238    30686407    TGFBR2_cds_1_0_chr3_30686239_f
chr9    101867487   101867584   TGFBR1_cds_0_0_chr9_101867488_f
awk

awk 'FNR==NR {A[$1]=$1; next}  $4 in A {sub ($4, $4 "_" A[$4]) }1' OFS='\t' file1 FS='\t' file2

下面的
awk
可能会对您有所帮助。此外,您还可以根据输入文件更改
FS
字段分隔符,例如-->输入文件1以空格分隔,然后在其前面使用
FS=”“
,输入文件2以制表符分隔,然后在其名称前面使用
FS=“\t”

awk '
FNR==NR{
  val=$2;
  sub(/\..*/,"",val);
  a[val]=$1;
  next
}
{
  split($4,array,"_")
}
((array[1]"_"array[2]) in a){
  sub(/.*_cds/,a[array[1]"_"array[2]]"_cds",$4);
  print
}
'   Input_file1   Input_file2
输出如下:

chr1 92149295 92149414 TGFBR3_cds_0_0_chr1_92149296_r
chr1 92161228 92161336 TGFBR3_cds_1_0_chr1_92161229_r
chr1 92163645 92163687 TGFBR3_cds_2_0_chr1_92163646_r
chr3 30648375 30648469 TGFBR2_cds_0_0_chr3_30648376_f
chr3 30686238 30686407 TGFBR2_cds_1_0_chr3_30686239_f
chr9 101867487 101867584 TGFBR1_cds_0_0_chr9_101867488_f

我现在发布了解决方案,我的解决方案认为您的字段分隔符是空间分隔符(因为您现在更新了它们),但我也添加了关于如何处理不同类型分隔符文件的说明,让我知道。效果很好,但在编辑我时,您比我快,如果我不想删除不匹配的行,但是,不是全部打印,而是在所有行的打印后指定为
print$0
?谢谢您的帮助:)。
子项是否仅打印匹配项以便打印匹配项,而非匹配项是否必须添加条件到数组[1]。也就是说,如果数组[1]中没有匹配项,则该行按原样打印?谢谢:)。@Chris,不,它是
(数组[1]“"数组[2])
检查它是否存在于数组
a
中,然后它给出输出。@Chris,我以后也会在代码中添加解释。