使用awk修剪multi-fasta文件中的前N个基,并使用最大宽度格式打印

使用awk修剪multi-fasta文件中的前N个基,并使用最大宽度格式打印,awk,gawk,fasta,Awk,Gawk,Fasta,背景 >gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV VTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1

背景

>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV VTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN multi fasta格式包含多条序列记录,每条记录以单行描述开始,然后是多条序列行(RNA、DNA、蛋白质)。描述行开头有大于的符号,“>”后面是序列的标识符,其余的行包含记录的描述(两者都是可选的)

在fasta文件中,通常将行序列格式化为最大宽度

输入示例,使用max width=“70”:


要回答您的特定问题,您可以使用
*
格式修改器指定输出字段的宽度:

$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
       foo
不,没有
join
函数将数组重新组合成字符串(与
split()
相反),但是如果让awk将记录拆分为字段,而不是手动将记录拆分为元素数组,然后,您只需为任何字段分配一个新值,awk就会将这些字段重新编译为$0,因为我在下面的第一个解决方案中使用
$1=“”
删除第一个字段/行时故意产生了副作用

以下是我如何完成任务的方法:

$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
    print RS $1
    $1 = ""
    for ( start=left_trim+1; start<=length(); start+=width ) {
        print substr($0,start,width)
    }
}

$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
$cat tst.awk
开始{RS=“>”FS=“\n”OFS=“”}
NR>1{
打印卢比$1
$1 = ""
对于(开始=左微调+1;开始GI | 304322925 |参考| YP | U 003856771.1 | NADH[猞猁rufus]
SFVGFSKPSPIYGFGLIVAGGIGGCGIVLNFGGGSFLGLMVFLYLGMLVGYTTAMATEPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVFKFNGDWVIYDTGDSGFSEEAMGIALYSYGTWLVv
VTGWSLIGVLVIMEVTRGN
>gi | 295065592 |参考| YP | U 003587393.1 | NADH[锡基游牧民族]
FvGFSSKPSPIYGLVLVvsgVvgCavinCggylGlmVvFylGmmVvfGytaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGGSGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
>gi | 295065550 |参考| YP 003587316.1 | NADH(线粒体)[合趾交趾]
FvGFSSKPSPIYGLVLVvsgVvgCaiIlldCggGyLmVfGlyLyLgMMVfGyTtaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKEYDGLVLVNFNNMGSWVIYEGEGGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
或者,如果您愿意:

$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec $0 }
END { prtRec() }
function prtRec(        start) {
    for ( start=left_trim+1; start<=length(rec); start+=width ) {
        print substr(rec,start,width)
    }
}

$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
$cat tst.awk
/^>/{prtRec();rec=“”;打印;下一步}
{rec=rec$0}
结束{prtRec()}
功能prtRec(启动){
对于(开始=左微调+1;开始GI | 304322925 |参考| YP | U 003856771.1 | NADH[猞猁rufus]
SFVGFSKPSPIYGFGLIVAGGIGGCGIVLNFGGGSFLGLMVFLYLGMLVGYTTAMATEPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVFKFNGDWVIYDTGDSGFSEEAMGIALYSYGTWLVv
VTGWSLIGVLVIMEVTRGN
>gi | 295065592 |参考| YP | U 003587393.1 | NADH[锡基游牧民族]
FvGFSSKPSPIYGLVLVvsgVvgCavinCggylGlmVvFylGmmVvfGytaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGGSGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
>gi | 295065550 |参考| YP 003587316.1 | NADH(线粒体)[合趾交趾]
FvGFSSKPSPIYGLVLVvsgVvgCaiIlldCggGyLmVfGlyLyLgMMVfGyTtaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKEYDGLVLVNFNNMGSWVIYEGEGGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
$1=“”
只比
sub(/[^\n]+\n/,”)
好,因为
OFS=“”
。如果没有这一点,作业将在行的开头引入一个不需要的空白字符。
$1=“/code>和
OFS=“
的组合使得
gsub()
不必要-
$1=“”
告诉awk重新编译
$0
OFS=“
告诉它在重新编译期间只删除字段之间的换行符(从
FS=“\n”
),而不是用其他字符(默认为空白)替换它们。
sequence=""; 
for(i=2; i<=length(a); i++){
  sequence=sequence a[i];
};
$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
       foo
$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
    print RS $1
    $1 = ""
    for ( start=left_trim+1; start<=length(); start+=width ) {
        print substr($0,start,width)
    }
}

$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec $0 }
END { prtRec() }
function prtRec(        start) {
    for ( start=left_trim+1; start<=length(rec); start+=width ) {
        print substr(rec,start,width)
    }
}

$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN