使用符号正确匹配Awk中的列?

使用符号正确匹配Awk中的列?,awk,multiple-columns,Awk,Multiple Columns,我有两个独立的文件,Input_File1和Input_File2,每个文件都包含不同数量的列,我根据多列中的数据合并了这些列() 到目前为止,已向Input_File1添加了一列,以根据Input_File1的第1、2和3列以及Input_File2的第1、2和3列中的数据匹配创建新文件(file3)。总的来说,这很有效。但是,在某些情况下,Input_File1和Input_File2中第1、2和3列中的数据相同,但file3中的输出应该不同。这是基于Input_File1和Input_Fi

我有两个独立的文件,Input_File1和Input_File2,每个文件都包含不同数量的列,我根据多列中的数据合并了这些列()

到目前为止,已向Input_File1添加了一列,以根据Input_File1的第1、2和3列以及Input_File2的第1、2和3列中的数据匹配创建新文件(file3)。总的来说,这很有效。但是,在某些情况下,Input_File1和Input_File2中第1、2和3列中的数据相同,但file3中的输出应该不同。这是基于Input_File1和Input_File2中的另一个特性,即存在“-”或“+”

输入文件1

VMNF01000007.1  6294425 6294650 .   .   +   Focub_B2_mimp_2
VMNF01000008.1  1441418 1441616 .   .   -   Focub_II5_mimp_3
VMNF01000008.1  1441418 1441616 .   .   -   Focub_B2_mimp_1
VMNF01000008.1  1441418 1441616 .   .   +   Focub_B2_mimp_2
输入文件2

VMNF01000007.1  6294425-6294650(+)  tacagtggggggcaataagtatgaataccctttggtgtactgacacacacctctt
VMNF01000008.1  1441418-1441616(-)  gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1  1441418-1441616(-)  gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1  1441418-1441616(+)  tacagtggggggcaataagtatgaataccctttgatgtactgacacacacctctt
如您所见,输入文件2的最后两行中的数据除了(-)和(+)之外是相同的,因此,下面的顺序不同

生成文件3时,第8列中的序列与输入文件2中的序列没有差异。这是因为在匹配列时只考虑数据
VMNF01000008.1 1441418 1441616

当前文件3(注意序列和+或-最后两行):

文件3实际上应该如下所示(注意序列和+或-最后两行):

其中,与输入_文件2中一样,当存在“-”或“+”时,序列不同

因此,它的操作方式与前面的代码基本相同,只是在Input_File1和Input_File2中添加了匹配的“-”或“+”,以确保后面的顺序是正确的。如何使用“-”或“+”来确定应该在第8列中添加到前面代码中的顺序

这是我正在使用的代码():


有什么建议吗?谢谢

请尝试以下内容

awk '
FNR==NR{
  split($2,array,"[-(]")
  key=$1 OFS array[1] OFS array[2]
  ++count1[key]
  mainarray[key OFS count1[key]]=$NF
  next
}
{
  key=$1 OFS $2 OFS $3
  ++count2[key]
}
((key OFS count2[key]) in mainarray){
  print $0,mainarray[key OFS count2[key]]
}
'  Input_file2  Input_file1
输出如下

VMNF01000007.1  6294425 6294650 .   .   +   Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttggtgtactgacacacacctctt
VMNF01000008.1  1441418 1441616 .   .   -   Focub_II5_mimp_3 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1  1441418 1441616 .   .   -   Focub_B2_mimp_1 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1  1441418 1441616 .   .   +   Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttgatgtactgacacacacctctt
说明:添加上述内容的详细说明

awk '                                          ##Starting awk program from here.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when file2 is being read.
  split($2,array,"[-(]")                       ##Splitting 2nd field into array named array with separator -( in it.
  key=$1 OFS array[1] OFS array[2]             ##Creating variable key whose value is $1 array 1st element and array 2nd element.
  ++count1[key]                                ##Creating array count1 with index key and keep increasing its value with 1 here.
  mainarray[key OFS count1[key]]=$NF           ##Creating array mainarray with index key OFS count1[key] value and its value is last column value.
  next                                         ##next will skip all further statements from here.
}
{
  key=$1 OFS $2 OFS $3                         ##Creating variable key with value of first, second and third field values.
  ++count2[key]                                ##Creating array count2 with index key and keepincreasing value with 1 here.
}
((key OFS count2[key]) in mainarray){          ##Checking condition if key OFS count2[key] is present in mainarray
  print $0,mainarray[key OFS count2[key]]      ##Printing current line and value of mainarray whose index is key OFS and value of count2  whose index is key.
}
'  Input_file2  Input_file1                    ##Mentioning Input_file names here.
awk '
FNR==NR{
  split($2,array,"[-(]")
  key=$1 OFS array[1] OFS array[2]
  ++count1[key]
  mainarray[key OFS count1[key]]=$NF
  next
}
{
  key=$1 OFS $2 OFS $3
  ++count2[key]
}
((key OFS count2[key]) in mainarray){
  print $0,mainarray[key OFS count2[key]]
}
'  Input_file2  Input_file1
VMNF01000007.1  6294425 6294650 .   .   +   Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttggtgtactgacacacacctctt
VMNF01000008.1  1441418 1441616 .   .   -   Focub_II5_mimp_3 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1  1441418 1441616 .   .   -   Focub_B2_mimp_1 gggagtgtattgttttttctgccgctagcccattttaacatttagagtgtgcata
VMNF01000008.1  1441418 1441616 .   .   +   Focub_B2_mimp_2 tacagtggggggcaataagtatgaataccctttgatgtactgacacacacctctt
awk '                                          ##Starting awk program from here.
FNR==NR{                                       ##Checking condition FNR==NR which will be TRUE when file2 is being read.
  split($2,array,"[-(]")                       ##Splitting 2nd field into array named array with separator -( in it.
  key=$1 OFS array[1] OFS array[2]             ##Creating variable key whose value is $1 array 1st element and array 2nd element.
  ++count1[key]                                ##Creating array count1 with index key and keep increasing its value with 1 here.
  mainarray[key OFS count1[key]]=$NF           ##Creating array mainarray with index key OFS count1[key] value and its value is last column value.
  next                                         ##next will skip all further statements from here.
}
{
  key=$1 OFS $2 OFS $3                         ##Creating variable key with value of first, second and third field values.
  ++count2[key]                                ##Creating array count2 with index key and keepincreasing value with 1 here.
}
((key OFS count2[key]) in mainarray){          ##Checking condition if key OFS count2[key] is present in mainarray
  print $0,mainarray[key OFS count2[key]]      ##Printing current line and value of mainarray whose index is key OFS and value of count2  whose index is key.
}
'  Input_file2  Input_file1                    ##Mentioning Input_file names here.