Awk 如何重新格式化fasta文件头的字段并折叠序列? 生物信息学中最常用的文件之一是 Fasta文件很简单:它们包含一个开始的“头”记录 带有一个“>”,后跟“Sequence”记录,即 标题,但在下一个记录分隔符之前(即“>”) 标题可以非常简单(例如“>ENSP0000488314.1”)或复杂 复杂的标题是重要但可变的信息 对于上述示例序列(来自),标题记录由以下部分组成: 但这样就可以使用ommit和/或包含字段。示例: 为简单起见,完全忽略字段09是完全可以接受的,但能够使用字段10则更好 然后能够将序列“折叠”到用户指定的数字。例如,序列每60个字符折叠一次的记录: 可能变成(每120个字符折叠一次序列): 到目前为止,我能做的最好的事情就是调用包含以下代码的脚本:
上面代码的问题是$1、$4和$5字段是硬编码的Awk 如何重新格式化fasta文件头的字段并折叠序列? 生物信息学中最常用的文件之一是 Fasta文件很简单:它们包含一个开始的“头”记录 带有一个“>”,后跟“Sequence”记录,即 标题,但在下一个记录分隔符之前(即“>”) 标题可以非常简单(例如“>ENSP0000488314.1”)或复杂 复杂的标题是重要但可变的信息 对于上述示例序列(来自),标题记录由以下部分组成: 但这样就可以使用ommit和/或包含字段。示例: 为简单起见,完全忽略字段09是完全可以接受的,但能够使用字段10则更好 然后能够将序列“折叠”到用户指定的数字。例如,序列每60个字符折叠一次的记录: 可能变成(每120个字符折叠一次序列): 到目前为止,我能做的最好的事情就是调用包含以下代码的脚本:,awk,text,Awk,Text,上面代码的问题是$1、$4和$5字段是硬编码的 一个类似问题的优雅解决方案由 ,但是,它需要我理解gawk扩展和AWK数组,这是我正在努力做到的 任何关于如何使用AWK(而不是Perl/Python)改进上述代码的想法都将不胜感激 这不仅说明了如何使用awk做您想要做的事情,而且还说明了如何正确地构造shell脚本,以便在解析参数后调用awk(如果使用shebang调用awk,则无法执行此操作,所以不要这样做)。它对gensub()使用GNU awk,并将第三个参数与之匹配() 不要
这不仅说明了如何使用awk做您想要做的事情,而且还说明了如何正确地构造shell脚本,以便在解析参数后调用awk(如果使用shebang调用awk,则无法执行此操作,所以不要这样做)。它对gensub()使用GNU awk,并将第三个参数与之匹配()
不要在shell脚本中使用shebang来调用awk解释器,只需调用
awk'script'${@:--}“
,因为shebang没有值得一试的地方,但它剥夺了您将shell擅长的事情与awk在shell脚本中擅长的事情分开的机会\s
和\s
分别表示“任意空格字符”和“任意非空格字符”,就像括号表达式中的POSIX字符类一样[:space:]
和[^[:space:]
。awk数组只是普通的旧关联数组(将字符串映射到值),就像在任何语言中一样。
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
> next record...
> another one...
Field 01: ENSP00000488314.1 <=Protein ID
Field 02: pep <=Peptide record
Field 03: chromosome:GRCh38:X:143884071:143885255:1 <=Chromosome and chromosomal coordinates
Field 04: gene:ENSG00000276380.2 <=Gene ID
Field 05: transcript:ENST00000618570.1 <=Transcript ID
Field 06: gene_biotype:polymorphic_pseudogene <=Gene Biotype
Field 07: transcript_biotype:polymorphic_pseudogene <=Transcript Biotype
Field 08: gene_symbol:UBE2NL <=Gene Symbol
Up to here the fields are all nicely separated by spaces, and then...Field 09 (Variable)
Field 09: description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
Field 10: [Source:HGNC Symbol;Acc:HGNC:31710] <=Predictable
>ENSP00000488314.1
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
Field: 01 04 05
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 [Source:HGNC Symbol;Acc:HGNC:31710]
Field: 01 02 03 10
>ENSP00000441696.1 pep chromosome:GRCh38:14:21868839:21869365:1 gene:ENSG00000211788.2 transcript:ENST00000390436.2 gene_biotype:TR_V_gene transcript_biotype:TR_V_gene gene_symbol:TRAV13-1 description:T cell receptor alpha variable 13-1 [Source:HGNC Symbol;Acc:HGNC:12108]
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 pep chromosome:GRCh38:X:143884071:143885255:1 gene:ENSG00000276380.2 transcript:ENST00000618570.1 gene_biotype:polymorphic_pseudogene transcript_biotype:polymorphic_pseudogene gene_symbol:UBE2NL description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:31710]
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 pep chromosome:GRCh38:22:42140203:42141924:-1 gene:ENSG00000205702.11 transcript:ENST00000435101.1 gene_biotype:polymorphic_pseudogene transcript_biotype:nonsense_mediated_decay gene_symbol:CYP2D7 description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene) [Source:HGNC Symbol;Acc:HGNC:2624]
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
>ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
awk -v w=60 -f script.awk fasta_file.fa
#!/usr/bin/env gawk
## Script.awk
/^>/ {
if (seq != "") print seq; print $1,$4,$5; seq = ""; next
}
{
seq = seq $1
while (length(seq) > w) {
print substr(seq, 1,w)
seq = substr(seq, 1+w)
}
}
END { if (seq != "") print seq }
$ cat tst.sh
#!/usr/bin/env bash
while getopts ":w:f:" opt; do
case "$opt" in
w) wid=${OPTARG}
;;
f) flds=${OPTARG}
;;
*) printf 'bad argument "%s"\n' "$opt" >&2
exit 1
;;
esac
done
shift "$((OPTIND-1))"
awk -v wid="$wid" -v flds="$flds" '
BEGIN {
wid=(wid ? wid : 120)
flds=(flds ? flds : "protein gene transcript")
numTags = split(flds,tags)
}
sub(/^>/,"") {
if (NR > 1) {
prt()
}
match($0,/(description:.*\S)\s+\[([^]]+)/,a)
$0 = substr($0,1,RSTART-1)
f["description"] = a[1]
f["predictable"] = a[2]
f["protein"] = $1
f["peptide"] = $2
for (i=3; i<=NF; i++) {
tag = gensub(/:.*/,"",1,$i)
f[tag] = $i
}
next
}
{ f["sequence"] = f["sequence"] $0 }
END { prt() }
function prt( tagNr, tag) {
printf ">"
for (tagNr=1; tagNr<=numTags; tagNr++) {
tag = tags[tagNr]
printf "%s%s", f[tag], (tagNr<numTags ? OFS : ORS)
}
print gensub(".{"wid"}","&"RS,"g",f["sequence"])
delete f
}
' "${@:--}"
$ ./tst.sh file
>ENSP00000441696.1 gene:ENSG00000211788.2 transcript:ENST00000390436.2
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>ENSP00000488314.1 gene:ENSG00000276380.2 transcript:ENST00000618570.1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>ENSP00000437680.2 gene:ENSG00000205702.11 transcript:ENST00000435101.1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
$ ./tst.sh -w 60 -f 'gene_symbol chromosome' file
>gene_symbol:TRAV13-1 chromosome:GRCh38:14:21868839:21869365:1
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELG
KGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>gene_symbol:UBE2NL chromosome:GRCh38:X:143884071:143885255:1
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLA
EEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDD
PLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>gene_symbol:CYP2D7 chromosome:GRCh38:22:42140203:42141924:-1
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
$ ./tst.sh -w 10000 -f 'description' file
>description:T cell receptor alpha variable 13-1
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>description:ubiquitin conjugating enzyme E2 N like (gene/pseudogene)
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>description:cytochrome P450 family 2 subfamily D member 7 (gene/pseudogene)
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT
$ ./tst.sh -w 10000 -f 'predictable' file
>Source:HGNC Symbol;Acc:HGNC:12108
MTSIRAVFIFLWLQLDLVNGENVEQHPSTLSVQEGDSAVIKCTYSDSASNYFPWYKQELGKGPQLIIDIRSNVGEKKDQRIAVTLNKTAKHFSLHITETQPEDSAVYFCAAS
>Source:HGNC Symbol;Acc:HGNC:31710
MAELPHRIIKETQRLLAEPVPGIKAEPDESNARYFHVVIAGESKDSPFEGGTFKRELLLAEEYPMAAPKVRFMTKIYHPNVDKLERIS*DILKDKWSPALQIRTVLLSIQALLNAPNPDDPLANDVVEQWKTNEAQAIETARAWTRLYAMNSI
>Source:HGNC Symbol;Acc:HGNC:2624
DPAQPPRDLTEAFLAKKEKAKGSPESSFNDENLRIVSVSNRRSTT