Awk 如何重新格式化fasta文件头的字段并折叠序列? 生物信息学中最常用的文件之一是 Fasta文件很简单:它们包含一个开始的“头”记录 带有一个“>”,后跟“Sequence”记录,即 标题,但在下一个记录分隔符之前(即“>”) 标题可以非常简单(例如“>ENSP0000488314.1”)或复杂 复杂的标题是重要但可变的信息 对于上述示例序列(来自),标题记录由以下部分组成: 但这样就可以使用ommit和/或包含字段。示例: 为简单起见,完全忽略字段09是完全可以接受的,但能够使用字段10则更好 然后能够将序列“折叠”到用户指定的数字。例如,序列每60个字符折叠一次的记录: 可能变成(每120个字符折叠一次序列): 到目前为止,我能做的最好的事情就是调用包含以下代码的脚本:

Awk 如何重新格式化fasta文件头的字段并折叠序列? 生物信息学中最常用的文件之一是 Fasta文件很简单:它们包含一个开始的“头”记录 带有一个“>”,后跟“Sequence”记录,即 标题,但在下一个记录分隔符之前(即“>”) 标题可以非常简单(例如“>ENSP0000488314.1”)或复杂 复杂的标题是重要但可变的信息 对于上述示例序列(来自),标题记录由以下部分组成: 但这样就可以使用ommit和/或包含字段。示例: 为简单起见,完全忽略字段09是完全可以接受的,但能够使用字段10则更好 然后能够将序列“折叠”到用户指定的数字。例如,序列每60个字符折叠一次的记录: 可能变成(每120个字符折叠一次序列): 到目前为止,我能做的最好的事情就是调用包含以下代码的脚本:,awk,text,Awk,Text,上面代码的问题是$1、$4和$5字段是硬编码的 一个类似问题的优雅解决方案由 ,但是,它需要我理解gawk扩展和AWK数组,这是我正在努力做到的 任何关于如何使用AWK(而不是Perl/Python)改进上述代码的想法都将不胜感激 这不仅说明了如何使用awk做您想要做的事情,而且还说明了如何正确地构造shell脚本,以便在解析参数后调用awk(如果使用shebang调用awk,则无法执行此操作,所以不要这样做)。它对gensub()使用GNU awk,并将第三个参数与之匹配() 不要

上面代码的问题是$1、$4和$5字段是硬编码的

  • 一个类似问题的优雅解决方案由 ,但是,它需要我理解gawk扩展和AWK数组,这是我正在努力做到的

  • 任何关于如何使用AWK(而不是Perl/Python)改进上述代码的想法都将不胜感激


  • 这不仅说明了如何使用awk做您想要做的事情,而且还说明了如何正确地构造shell脚本,以便在解析参数后调用awk(如果使用shebang调用awk,则无法执行此操作,所以不要这样做)。它对gensub()使用GNU awk,并将第三个参数与之匹配()


    不要在shell脚本中使用shebang来调用awk解释器,只需调用
    awk'script'${@:--}“
    ,因为shebang没有值得一试的地方,但它剥夺了您将shell擅长的事情与awk在shell脚本中擅长的事情分开的机会
    \s
    \s
    分别表示“任意空格字符”和“任意非空格字符”,就像括号表达式中的POSIX字符类一样
    [:space:]
    [^[:space:]
    。awk数组只是普通的旧关联数组(将字符串映射到值),就像在任何语言中一样。
    >ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
    MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
    EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
    PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
    > next record...
    > another one...
    
    Field 01: ENSP00000488314.1                         <=Protein ID
    Field 02: pep                                       <=Peptide record
    Field 03: chromosome:GRCh38:X:143884071:143885255:1 <=Chromosome and chromosomal coordinates
    Field 04: gene:ENSG00000276380.2                    <=Gene ID
    Field 05: transcript:ENST00000618570.1              <=Transcript ID
    Field 06: gene_biotype:polymorphic_pseudogene       <=Gene Biotype
    Field 07: transcript_biotype:polymorphic_pseudogene <=Transcript Biotype
    Field 08: gene_symbol:UBE2NL                        <=Gene Symbol
    Up to here the fields are all nicely separated by spaces, and then...Field 09 (Variable)
    Field 09: description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
    Field 10: [Source:HGNC Symbol;Acc:HGNC:31710]       <=Predictable
    
    >ENSP00000488314.1
    
    >ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
    Field: 01          04                     05
    >ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 [Source:HGNC Symbol;Acc:HGNC:31710]
    Field: 01          02  03                                        10
    
    >ENSP00000441696.1 pep chromosome:GRCh38:14:21868839:21869365:1 gene:ENSG00000211788.2 transcript:ENST00000390436.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRAV13-1 description:T cell receptor alpha variable 13-1 [Source:HGNC Symbol;Acc:HGNC:12108]
    MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
    KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
    >ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
    MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
    EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
    PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
    >ENSP00000437680.2 pep chromosome:GRCh38:22:42140203:42141924:-1 gene:ENSG00000205702.11 transcript:ENST00000435101.1 gene_biotype:polymorphic_pseudogene transcript_biotype:nonsense_mediated_decay gene_symbol:CYP2D7 description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:2624]
    DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
    
    >ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
    MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
    >ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
    MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
    PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
    >ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
    DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
    
    awk -v w=60 -f script.awk fasta_file.fa
    
    #!/usr/bin/env gawk
    ## Script.awk
    
    /^>/ {
        if (seq != "") print seq; print $1,$4,$5; seq = ""; next
    }
        {
            seq = seq $1
            while (length(seq) > w) {
                print substr(seq, 1,w)
                seq = substr(seq, 1+w)
            }
        }
        END { if (seq != "") print seq }
    
    $ cat tst.sh
    #!/usr/bin/env bash
    
    while getopts ":w:f:" opt; do
        case "$opt" in
            w)  wid=${OPTARG}
                ;;
            f)  flds=${OPTARG}
                ;;
            *)  printf 'bad argument "%s"\n' "$opt" >&2
                exit 1
                ;;
        esac
    done
    shift "$((OPTIND-1))"
    
    awk -v wid="$wid" -v flds="$flds" '
    BEGIN {
        wid=(wid ? wid : 120)
        flds=(flds ? flds : "protein gene transcript")
        numTags = split(flds,tags)
    }
    sub(/^>/,"") {
        if (NR > 1) {
            prt()
        }
        match($0,/(description:.*\S)\s+\[([^]]+)/,a)
        $0 = substr($0,1,RSTART-1)
        f["description"] = a[1]
        f["predictable"] = a[2]
        f["protein"] = $1
        f["peptide"] = $2
        for (i=3; i<=NF; i++) {
            tag = gensub(/:.*/,"",1,$i)
            f[tag] = $i
        }
        next
    }
    { f["sequence"] = f["sequence"] $0 }
    END { prt() }
    
    function prt(   tagNr, tag) {
        printf ">"
        for (tagNr=1; tagNr<=numTags; tagNr++) {
            tag = tags[tagNr]
            printf "%s%s", f[tag], (tagNr<numTags ? OFS : ORS)
        }
        print gensub(".{"wid"}","&"RS,"g",f["sequence"])
        delete f
    }
    ' "${@:--}"
    
    $ ./tst.sh file
    >ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
    MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
    >ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
    MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
    PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
    >ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
    DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
    
    $ ./tst.sh -w 60 -f 'gene_symbol chromosome' file
    >gene_symbol:TRAV13-1 chromosome:GRCh38:14:21868839:21869365:1
    MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
    KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
    >gene_symbol:UBE2NL chromosome:GRCh38:X:143884071:143885255:1
    MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
    EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
    PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
    >gene_symbol:CYP2D7 chromosome:GRCh38:22:42140203:42141924:-1
    DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
    
    $ ./tst.sh -w 10000 -f 'description' file
    >description:T cell receptor alpha variable 13-1
    MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
    >description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
    MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
    >description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene)
    DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
    
    $ ./tst.sh -w 10000 -f 'predictable' file
    >Source:HGNC Symbol;Acc:HGNC:12108
    MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
    >Source:HGNC Symbol;Acc:HGNC:31710
    MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
    >Source:HGNC Symbol;Acc:HGNC:2624
    DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT