Bash 如何按一列分组，并根据第三列中的数字划分第二列中的数字_Bash_Awk

Bash 如何按一列分组，并根据第三列中的数字划分第二列中的数字

bash awk

Bash 如何按一列分组，并根据第三列中的数字划分第二列中的数字,bash,awk,Bash,Awk,我在第二列中列出了同义编码和非同义编码（第3列）的基因突变频率（第1列）我需要计算每个基因的dN/dS比率（非同义编码/同义编码）并非所有基因都有同义编码和非同义编码频率 0.00491398 A1BG SYNONYMOUS_CODING 0.923601 A1BG NON_SYNONYMOUS_CODING 0.051361 A1CF NON_SYNONYMOUS_CODING 0.153161 A1CF SYNONYMOUS_CODING 0.0977385 A2M SYNONYMOUS

我在第二列中列出了

同义编码

和

非同义编码

（第3列）的基因突变频率（第1列）

我需要计算每个基因的

dN/dS

比率（

非同义编码/同义编码

）

并非所有基因都有

同义编码

和

非同义编码

频率

0.00491398 A1BG SYNONYMOUS_CODING
0.923601 A1BG NON_SYNONYMOUS_CODING
0.051361 A1CF NON_SYNONYMOUS_CODING
0.153161 A1CF SYNONYMOUS_CODING
0.0977385 A2M SYNONYMOUS_CODING
1.36114 A2M NON_SYNONYMOUS_CODING
2.19662 A2ML1 SYNONYMOUS_CODING
3.43866 A2ML1 NON_SYNONYMOUS_CODING

预期结果如下：

187.95 A1BG
0.3353 A1CF
13.926 A2M
1.565 A2ML1

下面是一个小的awk脚本：

cat script.awk

NR%2 { # process odd numbered lines
    readVars(); # read variables from line
    next; # skip processing, goto next line (even numbered line)
}
{ # process even numbered lines
    readVars(); # read variables from line
    print (nonSyn/syn), $2; # print variables division and print code
    syn = nonSyn = 0; # reset variables to 0
}
function readVars() {
    if ($3 ~ "NON_SYNONYMOUS_CODING") # if 3rd field match non_syn
        nonSyn = $1; # set nonSyn value to 1st field
    else syn = $1; # otherwize set syn value to 1st field
}

~Run:

awk -f script.awk input.txt

输出：

187.954 A1BG
2.98205 A1CF
13.9263 A2M
1.56543 A2ML1

A1BG 187.954
A1CF 0.33534
A2M 13.9263
A2ML1 1.56543

假设：

$ cat dNdSCompute.awk 
{
    #assign the first column value to syn or nonSyn depending on the third column value
    if ($3 == "NON_SYNONYMOUS_CODING")
        nonSyn = $1
    else syn = $1
    #if the current gene is the same as the previous one
    #print the result and reset the frequencie
    if ( $2 == gene){
        print (nonSyn/syn), $2
        syn = nonSyn = 0
    }
    #current gene name is saved in gene variable and will be used at next line
    gene = $2
}

$ awk -f dNdSCompute.awk genes 
187.954 A1BG
0.33534 A1CF
13.9263 A2M
1.56543 A2ML1

您的文件是通过基因名称排序的
如果不是这样，请运行
```
sort-k2 genes | awk-f dNdSCompute.awk
```
并非所有基因都可能同时具有
```
同义编码
```
和
```
非同义编码
```
频率=>在这种情况下，它们将被忽略，因为无法计算
```
dN/dS
```
比率

代码：

$ cat dNdSCompute.awk 
{
    #assign the first column value to syn or nonSyn depending on the third column value
    if ($3 == "NON_SYNONYMOUS_CODING")
        nonSyn = $1
    else syn = $1
    #if the current gene is the same as the previous one
    #print the result and reset the frequencie
    if ( $2 == gene){
        print (nonSyn/syn), $2
        syn = nonSyn = 0
    }
    #current gene name is saved in gene variable and will be used at next line
    gene = $2
}

$ awk -f dNdSCompute.awk genes 
187.954 A1BG
0.33534 A1CF
13.9263 A2M
1.56543 A2ML1

输入：

（基因不同时具有这两种频率）
输出：

$ cat dNdSCompute.awk { #assign the first column value to syn or nonSyn depending on the third column value if ($3 == "NON_SYNONYMOUS_CODING") nonSyn = $1 else syn = $1 #if the current gene is the same as the previous one #print the result and reset the frequencie if ( $2 == gene){ print (nonSyn/syn), $2 syn = nonSyn = 0 } #current gene name is saved in gene variable and will be used at next line gene = $2 }

$ awk -f dNdSCompute.awk genes 187.954 A1BG 0.33534 A1CF 13.9263 A2M 1.56543 A2ML1
使用GNU awk和（使用
$3
的值作为要调用的函数的名称）：
输出：

187.954 A1BG 2.98205 A1CF 13.9263 A2M 1.56543 A2ML1

A1BG 187.954 A1CF 0.33534 A2M 13.9263 A2ML1 1.56543

欢迎来到堆栈溢出！这是一个面向编程爱好者和专业人士的论坛。感谢您展示所需的输入/输出。请分享你解决这个问题的最佳尝试。这是什么课程？我看到了很多这样的基因数据问题。请更新您的输入/输出以反映缺失的频率情况。您将使用除法
0
，因为并非所有基因都可能同时具有
同义编码
和
非同义编码
frequencies++ve要获得好的代码和解释，请保持警惕，伙计：）