Linux 在awk中打印搜索模式

Linux 在awk中打印搜索模式,linux,bash,awk,Linux,Bash,Awk,我想打印匹配的搜索模式,然后计算平均行数。最好是一个示例: 输入文件: chr17 41275978 41276294 BRCA1_ex02_01 278 chr17 41275978 41276294 BRCA1_ex02_01 279 chr17 41275978 41276294 BRCA1_ex02_01 280 chr17 41275978 41276294 BRCA1_ex02_02 281 ch

我想打印匹配的搜索模式,然后计算平均行数。最好是一个示例:

输入文件:

chr17   41275978    41276294    BRCA1_ex02_01   278 
chr17   41275978    41276294    BRCA1_ex02_01   279 
chr17   41275978    41276294    BRCA1_ex02_01   280 
chr17   41275978    41276294    BRCA1_ex02_02   281 
chr17   41275978    41276294    BRCA1_ex02_02   282 
chr17   41275978    41276294    BRCA1_ex02_03   283 
chr17   41275978    41276294    BRCA1_ex02_03   284 
chr17   41275978    41276294    BRCA1_ex02_03   285 
chr17   41275978    41276294    BRCA1_ex02_04   286 
chr17   41275978    41276294    BRCA1_ex02_04   287 
chr17   41275978    41276294    BRCA1_ex02_04   288 
我在bash循环中提取wana(例如)与第4列相同:

产出1:

chr17   41275978    41276294    BRCA1_ex02_01   278 
chr17   41275978    41276294    BRCA1_ex02_01   279 
chr17   41275978    41276294    BRCA1_ex02_01   280 
输出2:

chr17   41275978    41276294    BRCA1_ex02_02   281 
chr17   41275978    41276294    BRCA1_ex02_02   282 
输出3:

chr17   41275978    41276294    BRCA1_ex02_03   283 
chr17   41275978    41276294    BRCA1_ex02_03   284 
chr17   41275978    41276294    BRCA1_ex02_03   285 
等等。。然后计算第5列的平均值非常简单:

_file.txt中的awk'END{sum+=$5}{print NR/sum}'

在我的例子中,有数千行BRCA1_exXX_XX-所以有什么想法可以拆分它吗


Paul.

假设条目按照给定数据中的第4列进行排序,您可以这样做:

awk '

  $4 != prev {              # if this line's 4th column is different from the previous line
    if (cnt > 0)            # if count of lines is greater than 0
      print prev, sum / cnt #   print the average
    prev = $4               # save previous 4th column
    sum = $5                # initialize sum to column 5
    cnt = 1                 # initialize count to 1
    next                    # go to next line
  }

  {
    sum += $5               # accumulate total of 5th column
    ++cnt                   # increment count of lines
  }

  END {
    if (cnt > 0)             # if count > 0 (avoid divide by 0 on empty file)
      print prev, sum / cnt  #   print the average for the last line
  }

' file

我想这会满足你的要求

awk '{
    # Keep running sum of fifth column based on value of fourth column.
    v[$4]+=$5;
    # Keep count of lines with similar fourth column values.
    n[$4]++
}

END {
    # Loop over all the values we saw and print out their fourth columns and the sum of the fifth columns.
    for (val in n) {
        print val ": " v[val] / n[val]
    }
}' $file

这假设条目总是有序的。Wau看起来有效:-)谢谢!有可能解释吗?我可以将标准偏差值添加到第三列吗?@EtanReisner是的,它假设条目按第四列排序,与给定的数据一样。只需在结尾部分添加一个对
n
的测试,以避免在空文件上出现被零除的错误。切勿将字母
l
用作变量名,因为它看起来太像数字
1
。在某些字体中完全无法区分。@埃德蒙顿说得很对。我用它来代替“行”,但在这种情况下也没有多大意义。编辑。是的,那太好了-它工作得非常好。谢谢你的解释!这将以随机顺序输出数据。假设您希望以与输入相同的顺序输出数据,只需将
n[$4]+
移出当前操作部分,并添加一个新的条件+操作
!n[$4]+{keys[++numKeys]=$4}
然后在结尾部分对(k=1;k)进行循环