awk未打印使用多个条件的1个条件的正确值

awk未打印使用多个条件的1个条件的正确值,awk,Awk,感谢@Jose Ricardo Bustos M。WHO的帮助使用file1和file2得出以下结论: 但是,我似乎无法使用brca1从file1捕获BRCA2(第2行跳过标题)。我不确定这是否是因为BCRA2是之后的第二个实例,或者$7的问题是否是完整基因序列和完整删除/重复分析,即完整基因序列与$7中的完整行部分匹配?谢谢:) file1 BRCA2 BCR SCN1A fbn1 文件2 Tier explanation . List code gene gene na

感谢@Jose Ricardo Bustos M。WHO的帮助使用
file1
file2
得出以下结论:

但是,我似乎无法使用
brca1从
file1
捕获
BRCA2
(第2行跳过标题)。我不确定这是否是因为
BCRA2
之后的第二个实例,
或者
$7
的问题是否是
完整基因序列和完整删除/重复分析
,即
完整基因序列
$7
中的完整行部分匹配?谢谢:)

file1

BRCA2
BCR
SCN1A
fbn1
文件2

Tier    explanation .   List code   gene    gene name   methodology disease
Tier 1  .   .   811 DMD dystrophin  deletion analysis and duplication analysis, if performed Publication Date: January 1, 2014  Duchenne/Becker muscular dystrophy
Tier 1  .   Jan-16  81  BRCA 1, BRCA2   breast cancer 1 and 2   full gene sequence and full deletion/duplication analysis   hereditary breast and ovarian cancer
Tier 1  .   Jan-16  70  ABL1    ABL1    gene analysis variants in the kinse domane  acquired imatinib tyrosine kinase inhibitor
Tier 1  .   .   806 BCR/ABL 1   t(9;22) major breakpoint, qualitative or quantitative   chronic myelogenous leukemia CML
Tier 1  .   Jan-16  85  FBN1    Fibrillin   full gene sequencing    heart disease
Tier 1  .   Jan-16  95  FBN1    fibrillin   del/dup heart disease
awk

awk 'BEGIN{FS=OFS="\t"}  # define fs and output
{$0=toupper($0)}  # convert all `file1` to uppercase
{$5=toupper($5)} # convert '$5' in `file2` to uppercase
{$7=toupper($7)} # convert '$7' in `file2` to uppercase
FNR==NR{ # process each field in line of `file1`
if(NR>1 && ($7 ~ /FULL GENE SEQUENC/)) {  # skip header and check for full gene sequenc or full gene sequencing, using `regexp`
      gsub(" ","",$5)       #removing white space
      n=split($5,v,"/")
      d[v[1]] = $4          #from split, first element as key
  }
  next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1 # print name then default if no match

BRCA2    279
BCR    279
SCN1A    279
FBN1    85
期望输出

BRCA2    81  --- match in line 2 of $5 in file 2, BRCA 1, BRCA2 and $7 has full gene sequence
BCR    279
SCN1A    279
FBN1    85

问题在于代码中的以下部分

gsub(" ","",$5)
n=split($5,v,"/")
d[v[1]] = $4
好吧,它对这种情况处理得很好,
BCR/ABL 1
正确,但是当您将它用于
BRCA 1,BRCA2
时,它不会产生您期望的结果。删除
BRCA1,BRCA2
上的空格将是
BRCA1,BRCA2
,按
/
拆分将产生相同的字符串
BRCA1,BRCA2
本身,因为反限制错误

因此,您需要通过
再次拆分字符串并对其进行散列。大概

n=split($5,v,",")
for (i=1; i <= n; i++) {
  d[v[i]] = $4
}
一起做,

gsub(" ","",$5)
n=split($5,v,"\\||,") 
for (i=1; i <= n; i++) {
  d[v[i]] = $4
}
gsub(“,”,$5)
n=拆分($5,v,“\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\”)

对于(i=1;i部分匹配对于
$7~/…/
,应该可以,但是对于
$1 in d
,它必须是精确匹配。调试代码。添加中间变量的打印,并检查不起作用的内容。非常感谢您的帮助和解释,我非常感谢:)。
gsub(" ","",$5)
n=split($5,v,"\\||,") 
for (i=1; i <= n; i++) {
  d[v[i]] = $4
}