Regex awk在特定str匹配和标题后对分组行中的值求和
我在awk有这个程序:Regex awk在特定str匹配和标题后对分组行中的值求和,regex,string,awk,count,Regex,String,Awk,Count,我在awk有这个程序: BEGIN { FS="[>;]" OFS=";" } function p(a, i) { for(i in a) print ">" i, "*nr=" ln } /^>/ {p(out);ln=0;split("",out);next} /[*]/ {idx=$2 OFS $3; out[idx]} {ln++} END
BEGIN {
FS="[>;]"
OFS=";"
}
function p(a, i)
{
for(i in a)
print ">" i, "*nr=" ln
}
/^>/ {p(out);ln=0;split("",out);next}
/[*]/ {idx=$2 OFS $3; out[idx]}
{ln++}
END {
if (ln) p(out)
}
它适用于以下文件:
>Cluster 300
0 151nt, >last238708;size=1... *
>Cluster 301
0 141nt, >last103379;size=1... at -/99.29%
1 151nt, >last104482;size=1... *
>Cluster 302
0 151nt, >last104505;size=1... *
>Cluster 303
0 119nt, >last325860;size=1... at +/99.16%
1 122nt, >last106751;size=1... at +/99.18%
2 151nt, >last284418;size=1... *
3 113nt, >last8067;size=3... at -/100.00%
4 122nt, >last8102;size=3... at -/100.00%
5 135nt, >last14200;size=2... at +/99.26%
>Cluster 304
0 151nt, >last285146;size=1... *
我需要的是,程序为每个集群打印带有星号的行的id(lastxxxxxx),并计算所有“size=”数字的总和。例如,对于群集303,它必须输出以下内容:
>last284418;nr=11
对于集群304:
>last285146;nr=1
目前,我的代码只能对行进行计数和求和,但没有考虑“size=”值。
谢谢你的帮助 请您尝试以下内容,仅在GNU
awk
中使用显示的样本编写和测试
awk '
/^>Cluster [0-9]+/{
if(sum){
print clus_line ORS val_line" = "sum
}
val_line=sum=clus_line=""
clus_line=$0
next
}
{
match($0,/size=[0-9]+/)
line=substr($0,RSTART,RLENGTH)
sub(/.*size=/,"",line)
sum+=line
}
/\*$/{
match($0,/>last[^;]*/)
val_line=substr($0,RSTART+1,RLENGTH-1)
}
END{
if(sum){
print clus_line ORS val_line" = "sum
}
}' Input_file
说明:添加上述内容的详细说明
awk ' ##Starting awk program from here.
/^>Cluster [0-9]+/{ ##Checking condition if line starts from Cluster with digits in line then do following.
if(sum){ ##Checking if variable sum is NOT NULL then do following.
print clus_line ORS val_line" = "sum ##Printing values of clus_line ORS(new line) val_line space = space and sum here.
}
val_line=sum=clus_line="" ##Nullifying val_line, sum and clus_line here.
clus_line=$0 ##Assigning current line to clus_line here.
next ##next will skip all further statements from here.
}
{
match($0,/size=[0-9]+/) ##Using match function to match size= digits in line.
line=substr($0,RSTART,RLENGTH) ##Creating line which has sub-string for current line starts from RSTART till RLENGTH.
sub(/.*size=/,"",line) ##Substituting everything till size= keyword here with NULL in line variable.
sum+=line ##Keep on adding value of digits in line variable in sum here.
}
/\*$/{ ##Checking condition if a line ends with * then do following.
match($0,/>last[^;]*/) ##Using match function to match >last till semi-colon comes here.
val_line=substr($0,RSTART+1,RLENGTH-1) ##Creating val_line which has sub-string of current line from RSTART+1 till RLENGTH-1 here.
}
END{ ##Starting END block of this program from here.
if(sum){ ##Checking if variable sum is NOT NULL then do following.
print clus_line ORS val_line" = "sum ##Printing values of clus_line ORS(new line) val_line space = space and sum here.
}
}' Input_file ##Mentioning Input_file name here.