使用符号正确匹配Awk中的列?
我有两个独立的文件,Input_File1和Input_File2,每个文件都包含不同数量的列,我根据多列中的数据合并了这些列() 到目前为止,已向Input_File1添加了一列,以根据Input_File1的第1、2和3列以及Input_File2的第1、2和3列中的数据匹配创建新文件(file3)。总的来说,这很有效。但是,在某些情况下,Input_File1和Input_File2中第1、2和3列中的数据相同,但file3中的输出应该不同。这是基于Input_File1和Input_File2中的另一个特性,即存在“-”或“+” 输入文件1使用符号正确匹配Awk中的列?,awk,multiple-columns,Awk,Multiple Columns,我有两个独立的文件,Input_File1和Input_File2,每个文件都包含不同数量的列,我根据多列中的数据合并了这些列() 到目前为止,已向Input_File1添加了一列,以根据Input_File1的第1、2和3列以及Input_File2的第1、2和3列中的数据匹配创建新文件(file3)。总的来说,这很有效。但是,在某些情况下,Input_File1和Input_File2中第1、2和3列中的数据相同,但file3中的输出应该不同。这是基于Input_File1和Input_Fi
VMNF01000007.1 6294425 6294650 . . + Focub_B2_mimp_2
VMNF01000008.1 1441418 1441616 . . - Focub_II5_mimp_3
VMNF01000008.1 1441418 1441616 . . - Focub_B2_mimp_1
VMNF01000008.1 1441418 1441616 . . + Focub_B2_mimp_2
输入文件2
VMNF01000007.1 6294425-6294650(+) tacagtggggggcaataagtatgaataccctttggtgtactgacacacacctctt
VMNF01000008.1 1441418-1441616(-) gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1 1441418-1441616(-) gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1 1441418-1441616(+) tacagtggggggcaataagtatgaataccctttgatgtactgacacacacctctt
如您所见,输入文件2的最后两行中的数据除了(-)和(+)之外是相同的,因此,下面的顺序不同
生成文件3时,第8列中的序列与输入文件2中的序列没有差异。这是因为在匹配列时只考虑数据VMNF01000008.1 1441418 1441616
当前文件3(注意序列和+或-最后两行):
文件3实际上应该如下所示(注意序列和+或-最后两行):
其中,与输入_文件2中一样,当存在“-”或“+”时,序列不同
因此,它的操作方式与前面的代码基本相同,只是在Input_File1和Input_File2中添加了匹配的“-”或“+”,以确保后面的顺序是正确的。如何使用“-”或“+”来确定应该在第8列中添加到前面代码中的顺序
这是我正在使用的代码():
有什么建议吗?谢谢请尝试以下内容
awk '
FNR==NR{
split($2,array,"[-(]")
key=$1 OFS array[1] OFS array[2]
++count1[key]
mainarray[key OFS count1[key]]=$NF
next
}
{
key=$1 OFS $2 OFS $3
++count2[key]
}
((key OFS count2[key]) in mainarray){
print $0,mainarray[key OFS count2[key]]
}
' Input_file2 Input_file1
输出如下
VMNF01000007.1 6294425 6294650 . . + Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttggtgtactgacacacacctctt
VMNF01000008.1 1441418 1441616 . . - Focub_II5_mimp_3 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1 1441418 1441616 . . - Focub_B2_mimp_1 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1 1441418 1441616 . . + Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttgatgtactgacacacacctctt
说明:添加上述内容的详细说明
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
split($2,array,"[-(]") ##Splitting 2nd field into array named array with separator -( in it.
key=$1 OFS array[1] OFS array[2] ##Creating variable key whose value is $1 array 1st element and array 2nd element.
++count1[key] ##Creating array count1 with index key and keep increasing its value with 1 here.
mainarray[key OFS count1[key]]=$NF ##Creating array mainarray with index key OFS count1[key] value and its value is last column value.
next ##next will skip all further statements from here.
}
{
key=$1 OFS $2 OFS $3 ##Creating variable key with value of first, second and third field values.
++count2[key] ##Creating array count2 with index key and keepincreasing value with 1 here.
}
((key OFS count2[key]) in mainarray){ ##Checking condition if key OFS count2[key] is present in mainarray
print $0,mainarray[key OFS count2[key]] ##Printing current line and value of mainarray whose index is key OFS and value of count2 whose index is key.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.
awk '
FNR==NR{
split($2,array,"[-(]")
key=$1 OFS array[1] OFS array[2]
++count1[key]
mainarray[key OFS count1[key]]=$NF
next
}
{
key=$1 OFS $2 OFS $3
++count2[key]
}
((key OFS count2[key]) in mainarray){
print $0,mainarray[key OFS count2[key]]
}
' Input_file2 Input_file1
VMNF01000007.1 6294425 6294650 . . + Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttggtgtactgacacacacctctt
VMNF01000008.1 1441418 1441616 . . - Focub_II5_mimp_3 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1 1441418 1441616 . . - Focub_B2_mimp_1 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1 1441418 1441616 . . + Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttgatgtactgacacacacctctt
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition FNR==NR which will be TRUE when file2 is being read.
split($2,array,"[-(]") ##Splitting 2nd field into array named array with separator -( in it.
key=$1 OFS array[1] OFS array[2] ##Creating variable key whose value is $1 array 1st element and array 2nd element.
++count1[key] ##Creating array count1 with index key and keep increasing its value with 1 here.
mainarray[key OFS count1[key]]=$NF ##Creating array mainarray with index key OFS count1[key] value and its value is last column value.
next ##next will skip all further statements from here.
}
{
key=$1 OFS $2 OFS $3 ##Creating variable key with value of first, second and third field values.
++count2[key] ##Creating array count2 with index key and keepincreasing value with 1 here.
}
((key OFS count2[key]) in mainarray){ ##Checking condition if key OFS count2[key] is present in mainarray
print $0,mainarray[key OFS count2[key]] ##Printing current line and value of mainarray whose index is key OFS and value of count2 whose index is key.
}
' Input_file2 Input_file1 ##Mentioning Input_file names here.