Awk 整理资料、编词典

Awk 整理资料、编词典,awk,gawk,Awk,Gawk,我有一个以制表符分隔的文件,如下所示: chr14 106559873 106560782 MA0004.1_Arnt chr14 106559873 106560782 MA0093.1_USF1 chr14 106559873 106560782 MA0147.1_Myc chr14 106559873 106560782 RUNX3_DBD_WAACCRCAAWAACCRCAN

我有一个以制表符分隔的文件,如下所示:

chr14   106559873       106560782       MA0004.1_Arnt
chr14   106559873       106560782       MA0093.1_USF1
chr14   106559873       106560782       MA0147.1_Myc
chr14   106559873       106560782       RUNX3_DBD_WAACCRCAAWAACCRCAN
chr10   17037867        17038971        MA0080.2_SPI1
chr10   17037867        17038971        MA0152.1_NFATC2
chr17   8610947 8611433 MA0080.2_SPI1
chr17   8610947 8611433 MA0098.1_ETS1
Regions   MA0004.1_Arnt  MA0093.1_USF1  MA0147.1_Myc  RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1
chr14;106559873;106560782   1 1 1 1 0 0 0
chr10;17037867;17038971     0 0 0 0 1 1 0
chr10;17037867;17038971     0 0 0 0 1 0 1
我想这样安排:

chr14   106559873       106560782       MA0004.1_Arnt
chr14   106559873       106560782       MA0093.1_USF1
chr14   106559873       106560782       MA0147.1_Myc
chr14   106559873       106560782       RUNX3_DBD_WAACCRCAAWAACCRCAN
chr10   17037867        17038971        MA0080.2_SPI1
chr10   17037867        17038971        MA0152.1_NFATC2
chr17   8610947 8611433 MA0080.2_SPI1
chr17   8610947 8611433 MA0098.1_ETS1
Regions   MA0004.1_Arnt  MA0093.1_USF1  MA0147.1_Myc  RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1
chr14;106559873;106560782   1 1 1 1 0 0 0
chr10;17037867;17038971     0 0 0 0 1 1 0
chr10;17037867;17038971     0 0 0 0 1 0 1
示例输出仅显示前四行,但这需要应用于整个文件。1表示字符串的存在

Snce这是我正在编写的代码的中间部分,它对我的分析至关重要。我再也不想在awk里怎么做了


谢谢。

此awk脚本将为您提供大部分帮助:

BEGIN {
    print "Regions   MA0004.1_Arnt  MA0093.1_USF1  MA0147.1_Myc  RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1"
    a["MA0004.1_Arnt"] = a["MA0093.1_USF1"] = \
    a["MA0147.1_Myc"] = a["RUNX3_DBD_WAACCRCAAWAACCRCAN"] = \
    a["MA0080.2_SPI1"] = a["MA0152.1_NFATC2"] = a["MA0098.1_ETS1"] = 0
}

function print_fields () {
    print p";"s";"e, a["MA0004.1_Arnt"], a["MA0093.1_USF1"],
    a["MA0147.1_Myc"], a["RUNX3_DBD_WAACCRCAAWAACCRCAN"],
    a["MA0080.2_SPI1"], a["MA0152.1_NFATC2"], a["MA0098.1_ETS1"]    
}

NR>1&&$1!=p {
    print_fields()
    for (i in a) a[i] = 0
}

{ p=$1; s=$2; e=$3; a[$4]=1 }

END { print_fields() }
测试它:

$ awk -f script.awk file
Regions   MA0004.1_Arnt  MA0093.1_USF1  MA0147.1_Myc  RUNX3_DBD_WAACCRCAAWAACCRCAN MA0080.2_SPI1 MA0152.1_NFATC2 MA0098.1_ETS1
chr14;106559873;106560782 1 1 1 1 0 0 0
chr10;17037867;17038971 0 0 0 0 1 1 0
chr17;8610947;8611433 0 0 0 0 1 0 1

1表示字符串的存在或不存在,如区域chr14 106559873 106560782的MA0004.1_Arnt。因此,1.如果存在,则1.否则0.您必须读取整个文件才能开始打印…即使可行,awk似乎不是实际的选择。输出选项卡是否分开?否,选项卡应该在哪里?尝试将
OFS=“\t”
添加到
开始
块,这是否会产生所需的输出?