Bash 合并两个文件,同时在给定列awk中保留值较大的行
我有两个以制表符分隔的文件Bash 合并两个文件,同时在给定列awk中保留值较大的行,bash,shell,awk,grep,Bash,Shell,Awk,Grep,我有两个以制表符分隔的文件 A 500 50 A 600 30 B 300 100 C 600 40 及 我想合并这两个文件,对于第1列和第2列中的匹配行,我想在第3列中保留一个值更大的 因此,输出将是: A 500 70 A 600 30 B 300 100 C 600 40 这些是真实值的样本 ==> cut125_beng_jointvcf_varcal_geno6.txt <== scaffold_3015 5910
A 500 50
A 600 30
B 300 100
C 600 40
及
我想合并这两个文件,对于第1列和第2列中的匹配行,我想在第3列中保留一个值更大的
因此,输出将是:
A 500 70
A 600 30
B 300 100
C 600 40
这些是真实值的样本
==> cut125_beng_jointvcf_varcal_geno6.txt <==
scaffold_3015 5910 44.88210969
scaffold_3015 5912 67.86783682
scaffold_3015 5916 79.02675660
scaffold_3015 5926 18.41190163
scaffold_3015 5930 42.07625795
scaffold_3015 5931 52.63549142
scaffold_3015 5954 37.34609103
scaffold_3015 5983 47.36974946
scaffold_3015 5991 41.45881125
==> cut125_wbm_jointvcf_varcal_geno6.txt <==
scaffold_3015 5910 50.79731830
scaffold_3015 5916 146.20529658
scaffold_3015 5926 184.50309487
scaffold_3015 5930 160.27435340
scaffold_3015 5931 172.71907060
scaffold_3015 5954 161.39740159
scaffold_3015 5968 146.54839149
scaffold_3015 5983 97.01874773
scaffold_3015 5991 73.54761456
=>cut125\u beng\u jointvcf\u varcal\u geno6.txt cut125\u wbm\u jointvcf\u varcal\u geno6.txt请尝试以下内容
awk '
FNR==NR{
a[$1,$2]=$3
next
}
($1,$2) in a{
$3=(a[$1,$2]>$3?a[$1,$2]:$3)
b[$1,$2]
}
1
END{
for(i in a){
if(!(i in b)){
print i,a[i]
}
}
}' SUBSEP=" " Input_file1 Input_file2
这将处理那些在两个输入_文件中也不常见的元素,因此,如果输入_文件1和输入_文件2中不存在元素,则它将打印该元素,反之亦然
解释:也为上述代码添加解释
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file name Input_file1 is being read.
a[$1,$2]=$3 ##Creating array a whose index is $1,$2 and value is $3 of current line.
next ##next is awk out of box keyword to skip all further statements.
}
($1,$2) in a{ ##Checking conditoin here if Input_file2 $1,$2 of current line is coming in array a then do following.
$3=(a[$1,$2]>$3?a[$1,$2]:$3) ##Re-creating $3(3rd column) of current line where if value of a[$1,$2] is greater than $3 than change it to a[$1,$2] else keep it $3.
b[$1,$2] ##Creating an array named b whose index is $1,$2 by this we are keeping track whichever line common in Input_file1 and Input_file2.
}
1 ##BY mentioning 1 it will print the current line(edited or non-edited by $3).
END{ ##Starting END block of current awk code here.
for(i in a){ ##Starting for loop to traverse through array a.
if(!(i in b)){ ##Checking if index i is NOT present in array b means un-common lines which did not get print from Input-file1.
print i,a[i] ##Printing index i and array a value a[i] here.
}
}
}' SUBSEP=" " Input_file1 Input_file2 ##Mentioning SUBSEP value as space and mentioning Input_file1 and Input_file2 here.
编辑:根据OP,输出行的顺序应与输入文件2和输入文件1相同,然后添加以下解决方案
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file name Input_file1 is being read.
a[$1,$2]=$3 ##Creating array a whose index is $1,$2 and value is $3 of current line.
if(!b[$1,$2]++){ ##Checking condition here if $1 and $2 is NOT having any index on array b then do following.
d[++count]=$1 OFS $2} ##Creating array named d whose index is increasing variable count with value of $1 OFS $2 in it.
next ##next is awk out of box keyword to skip all further statements.
}
($1,$2) in a{ ##Checking conditoin here if Input_file2 $1,$2 of current line is coming in array a then do following.
$3=a[$1,$2]>$3?a[$1,$2]:$3 ##Re-creating $3(3rd column) of current line where if value of a[$1,$2] is greater than $3 than change it to a[$1,$2] else keep it $3.
c[$1,$2] ##Creating an array named b whose index is $1,$2 by this we are keeping track whichever line common in Input_file1 and Input_file2.
}
1 ##BY mentioning 1 it will print the current line(edited or non-edited by $3).
END{ ##Starting END block of current awk code here.
for(i=1;i<=count;i++){ ##Starting for loop to traverse through array a.
if(!(d[i] in c)){ ##Checking if value of array d whose index is i NOT present in array c means un-common lines which did not get print from Input-file1.
print d[i],a[d[i]] ##Printing value of array d whose index is i and array a value a[i] here.
}
}
}' SUBSEP=" " FilE1 FilE2 ##Mentioning SUBSEP value as space and mentioning Input_file1 and Input_file2 here.
awk'
FNR==NR{{##检查条件FNR==NR,当读取第一个输入文件名Input#file1时,该条件为真。
a[$1,$2]=$3##创建一个索引为$1,$2,值为当前行$3的数组。
如果(!b[$1,$2]+){##这里检查条件,如果$1和$2在数组b上没有任何索引,则执行以下操作。
d[++count]=$1 of s$2}##创建名为d的数组,该数组的索引是增加变量计数,其中的值为$1 of s$2。
next##next是awk开箱即用关键字,用于跳过所有进一步的语句。
}
($1,$2)在{##检查条件中,如果当前行的输入文件2$1,$2出现在数组a中,则执行以下操作。
$3=a[$1,$2]>$3?a[$1,$2]:$3##重新创建当前行的$3(第三列),如果[$1,$2]的值大于$3,则将其更改为[$1,$2],否则保留$3。
c[$1,$2]##创建一个名为b的数组,其索引为$1,$2。通过此操作,我们将跟踪输入文件1和输入文件2中的哪一行。
}
1##通过提及1,它将打印当前行(按$3编辑或未编辑)。
END{##当前awk代码的起始结束块。
对于(i=1;我回答您的问题是为了说明您试图解决的问题,并告诉我们您在哪里需要帮助才能取得进展。@Ed Morton,先生,您能告诉我我的代码对齐是否得到了改进并且现在看起来很好吗?先生?我将非常感谢您。毫无疑问,该结构现在是所有Algol派生语言中普遍使用的结构,将由任何C美化程序输出,并且非常易于阅读。感谢您修复它并询问!现在,如果我们可以让您将三元表达式括起来……:-)@EdMorton,酷,谢谢你的反馈,为这个括号而节拍。我有时会忘记它们,但我开始使用它们。老实说,你的指导总是有帮助的,你摇滚。顺便说一句,我还没有投票,因为我在阅读他们的问题之前正在等待OP将他们的尝试添加到他们的问题中,所以我不知道你的脚本是否有效,因为我不知道它应该做什么呢!我甚至不打算去想它,直到我在问题中看到一个努力。我实际上已经注意到你在插入三元表方面做得更好-干得好!
awk '
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when first Input_file name Input_file1 is being read.
a[$1,$2]=$3 ##Creating array a whose index is $1,$2 and value is $3 of current line.
if(!b[$1,$2]++){ ##Checking condition here if $1 and $2 is NOT having any index on array b then do following.
d[++count]=$1 OFS $2} ##Creating array named d whose index is increasing variable count with value of $1 OFS $2 in it.
next ##next is awk out of box keyword to skip all further statements.
}
($1,$2) in a{ ##Checking conditoin here if Input_file2 $1,$2 of current line is coming in array a then do following.
$3=a[$1,$2]>$3?a[$1,$2]:$3 ##Re-creating $3(3rd column) of current line where if value of a[$1,$2] is greater than $3 than change it to a[$1,$2] else keep it $3.
c[$1,$2] ##Creating an array named b whose index is $1,$2 by this we are keeping track whichever line common in Input_file1 and Input_file2.
}
1 ##BY mentioning 1 it will print the current line(edited or non-edited by $3).
END{ ##Starting END block of current awk code here.
for(i=1;i<=count;i++){ ##Starting for loop to traverse through array a.
if(!(d[i] in c)){ ##Checking if value of array d whose index is i NOT present in array c means un-common lines which did not get print from Input-file1.
print d[i],a[d[i]] ##Printing value of array d whose index is i and array a value a[i] here.
}
}
}' SUBSEP=" " FilE1 FilE2 ##Mentioning SUBSEP value as space and mentioning Input_file1 and Input_file2 here.