Python Awk比较3个值，第一个文件值之间的第二个文件值，并将两个文件之间的多列打印输出到第三个文件_Python_Bash_Powershell_Awk_Sed

Python Awk比较3个值，第一个文件值之间的第二个文件值，并将两个文件之间的多列打印输出到第三个文件

python bash powershell awk sed

Python Awk比较3个值，第一个文件值之间的第二个文件值，并将两个文件之间的多列打印输出到第三个文件,python,bash,powershell,awk,sed,Python,Bash,Powershell,Awk,Sed,我试图在两个以制表符分隔的大文件之间进行比较。我一直在尝试使用awk&bash（ubuntu15.10）、python（v3.5）和powershell（windows10）。我唯一的背景是Java，但我的领域倾向于使用脚本语言我想看看文件1 A[] 1 gramene gene 4854 9652 . - . ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized p

我试图在两个以制表符分隔的大文件之间进行比较。我一直在尝试使用awk&bash（ubuntu15.10）、python（v3.5）和powershell（windows10）。我唯一的背景是Java，但我的领域倾向于使用脚本语言

我想看看

文件1 A[]

1   gramene gene    4854    9652    .   -   .   ID=gene:GRMZM2G059865;biotype=protein_coding;description=Uncharacterized protein  [Source:UniProtKB/TrEMBL%3BAcc:C0P8I2];gene_id=GRMZM2G059865;logic_name=genebuilder;version=1
1   gramene gene    9882    10387   .   -   .   ID=gene:GRMZM5G888250;biotype=protein_coding;gene_id=GRMZM5G888250;logic_name=genebuilder;version=1
1   gramene gene    109519  111769  .   -   .   ID=gene:GRMZM2G093344;biotype=protein_coding;gene_id=GRMZM2G093344;logic_name=genebuilder;version=1
1   gramene gene    136307  138929  .   +   .   ID=gene:GRMZM2G093399;biotype=protein_coding;gene_id=GRMZM2G093399;logic_name=genebuilder;version=1

文件2b[]

S1_6370 T/C 1   6370    +
S1_8210 T   1   8210    +
S1_8376 A   1   8376    +
S1_9889 A   1   9889    +

输出

1   ID=gene:GRMZM2G059865   4857    9652    -   S1_6370 T/C 6370    +   
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8210 T   8210    +
1   ID=gene:GRMZM2G059865   4857    9652    -   S1_8376 A   8376    +
1   ID=gene:GRMZM5G888250   9882    10387   -   S1_9889 A   9889    +

我的一般逻辑

loop (until end of A[ ] and B[ ])
if
B[$4]>A[$4] && B[$4]<A[$5]  #if the value in B column 4 is in between the values in A columns 4 & 5.
then
-F”\t” print {A[1], A[9(filtered)], A[$4FS$5], B[$1], B[$2], B[$3], B[$4], B[$5]}   #hopefully reflects awk column calls if the two files were able to have their columns defined that way.
movea++ # to see if the next set of B column 4 values is in between the values in A columns 4 & 5 
else
moveb++ #to see if the next set of A columns 4&5 values contain the current vales of B column 4 in them.

循环（直到A[]和B[]结束）
如果
B[$4]>A[$4]&B[$4]#参考
#参考
#唯一有效的一件事
#可能可以筛选第9列
#这似乎是最接近的，但没有在我想要的第三个文件中打印出来，仍然无法完全理解语法
尝试：
$ awk 'BEGIN{x=getline s <"B"; split(s,b,"\t")} !x{exit} {sub(/;.*/,"",$9); while (x && $4<b[4] && b[4]<$5){print $1,$9,$4,$5,$7,b[1],b[2],b[4],b[5]; x=getline s <"B"; split(s,b,"\t")}}' OFS='\t' A
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_6370 T/C     6370    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8210 T       8210    +
1       ID=gene:GRMZM2G059865   4854    9652    -       S1_8376 A       8376    +
1       ID=gene:GRMZM5G888250   9882    10387   -       S1_9889 A       9889    +

另一个基于awk的解决方案：
$ awk -F'\t' 'NR==FNR{
         b0[NR]=$0;
         b4[NR]=$4;
         b_count=NR;
         next;
       }
       {
           for(i=1;i<=b_count;i++)
              if((b4[i]>$4) && (b4[i]<$5)){
                  print $1, gensub(/;.*/,"",1,$9), $4, $5, b0[i]
              }
       }' OFS=$'\t' file_b file_a

说明：
NR==FNR
第一个文件-文件
在本地数组中记录整个文件-b0和b4
&跳过第二个文件的处理代码-next
对于下一个文件，比较并以所需格式打印行
gensub
：一个正则表达式替换函数，用于格式化文件A中的第9个字段。也可以使用类似于split
功能的替代机制
StackOverflow编程是最糟糕的。你应该花时间学习awk、python或其他任何东西的语法，但你只是浪费时间参考了大量的SO问题，如果你还没有花时间学习基础知识，这些问题很难帮助你学习语言。还有，你说“我说服了自己……”你完全错了。1.6GB？一个Python字符串就可以处理这个问题，更不用说行处理了。paste fileB fileA | awk…可能是一个很好的方法…@anishsane不，两个文件的行之间没有一对一的对应关系。哦，我当时误解了……数字4857出现在所需的输出中，但在输入中似乎没有。你是说4854吗？
$ awk -F'\t' 'NR==FNR{
         b0[NR]=$0;
         b4[NR]=$4;
         b_count=NR;
         next;
       }
       {
           for(i=1;i<=b_count;i++)
              if((b4[i]>$4) && (b4[i]<$5)){
                  print $1, gensub(/;.*/,"",1,$9), $4, $5, b0[i]
              }
       }' OFS=$'\t' file_b file_a

1   ID=gene:GRMZM2G059865   4854    9652    S1_6370 T/C 1   6370    +
1   ID=gene:GRMZM2G059865   4854    9652    S1_8210 T   1   8210    +
1   ID=gene:GRMZM2G059865   4854    9652    S1_8376 A   1   8376    +
1   ID=gene:GRMZM5G888250   9882    10387   S1_9889 A   1   9889    +