使用awk修剪multi-fasta文件中的前N个基,并使用最大宽度格式打印
背景 >gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV VTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki] FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus] FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV TGWTLLVGVYIVIEIARGN multi fasta格式包含多条序列记录,每条记录以单行描述开始,然后是多条序列行(RNA、DNA、蛋白质)。描述行开头有大于的符号,“>”后面是序列的标识符,其余的行包含记录的描述(两者都是可选的) 在fasta文件中,通常将行序列格式化为最大宽度 输入示例,使用max width=“70”:使用awk修剪multi-fasta文件中的前N个基,并使用最大宽度格式打印,awk,gawk,fasta,Awk,Gawk,Fasta,背景 >gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus] SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV VTGWSLLIGVLVIMEVTRGN >gi|295065592|ref|YP_003587393.1
要回答您的特定问题,您可以使用
*
格式修改器指定输出字段的宽度:
$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
foo
不,没有join
函数将数组重新组合成字符串(与split()
相反),但是如果让awk将记录拆分为字段,而不是手动将记录拆分为元素数组,然后,您只需为任何字段分配一个新值,awk就会将这些字段重新编译为$0,因为我在下面的第一个解决方案中使用$1=“”
删除第一个字段/行时故意产生了副作用
以下是我如何完成任务的方法:
$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
print RS $1
$1 = ""
for ( start=left_trim+1; start<=length(); start+=width ) {
print substr($0,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
$cat tst.awk
开始{RS=“>”FS=“\n”OFS=“”}
NR>1{
打印卢比$1
$1 = ""
对于(开始=左微调+1;开始GI | 304322925 |参考| YP | U 003856771.1 | NADH[猞猁rufus]
SFVGFSKPSPIYGFGLIVAGGIGGCGIVLNFGGGSFLGLMVFLYLGMLVGYTTAMATEPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVFKFNGDWVIYDTGDSGFSEEAMGIALYSYGTWLVv
VTGWSLIGVLVIMEVTRGN
>gi | 295065592 |参考| YP | U 003587393.1 | NADH[锡基游牧民族]
FvGFSSKPSPIYGLVLVvsgVvgCavinCggylGlmVvFylGmmVvfGytaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGGSGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
>gi | 295065550 |参考| YP 003587316.1 | NADH(线粒体)[合趾交趾]
FvGFSSKPSPIYGLVLVvsgVvgCaiIlldCggGyLmVfGlyLyLgMMVfGyTtaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKEYDGLVLVNFNNMGSWVIYEGEGGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
或者,如果您愿意:
$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec $0 }
END { prtRec() }
function prtRec( start) {
for ( start=left_trim+1; start<=length(rec); start+=width ) {
print substr(rec,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
$cat tst.awk
/^>/{prtRec();rec=“”;打印;下一步}
{rec=rec$0}
结束{prtRec()}
功能prtRec(启动){
对于(开始=左微调+1;开始GI | 304322925 |参考| YP | U 003856771.1 | NADH[猞猁rufus]
SFVGFSKPSPIYGFGLIVAGGIGGCGIVLNFGGGSFLGLMVFLYLGMLVGYTTAMATEPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVFKFNGDWVIYDTGDSGFSEEAMGIALYSYGTWLVv
VTGWSLIGVLVIMEVTRGN
>gi | 295065592 |参考| YP | U 003587393.1 | NADH[锡基游牧民族]
FvGFSSKPSPIYGLVLVvsgVvgCavinCggylGlmVvFylGmmVvfGytaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGGSGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
>gi | 295065550 |参考| YP 003587316.1 | NADH(线粒体)[合趾交趾]
FvGFSSKPSPIYGLVLVvsgVvgCaiIlldCggGyLmVfGlyLyLgMMVfGyTtaMaiEyeyPawGSG
VEVLVGLVGLAMEVGLVLWAKEYDGLVLVNFNNMGSWVIYEGEGGLIREDSIGAGALYDYGRWLVVV
tgwtllvgvyivieargn
$1=“”
只比sub(/[^\n]+\n/,”)
好,因为OFS=“”
。如果没有这一点,作业将在行的开头引入一个不需要的空白字符。$1=“/code>和OFS=“
的组合使得gsub()
不必要-$1=“”
告诉awk重新编译$0
,OFS=“
告诉它在重新编译期间只删除字段之间的换行符(从FS=“\n”
),而不是用其他字符(默认为空白)替换它们。
sequence="";
for(i=2; i<=length(a); i++){
sequence=sequence a[i];
};
$ awk 'BEGIN{printf "%s\n", "foo"}'
foo
$ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
foo
$ cat tst.awk
BEGIN { RS=">"; FS="\n"; OFS="" }
NR>1 {
print RS $1
$1 = ""
for ( start=left_trim+1; start<=length(); start+=width ) {
print substr($0,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
$ cat tst.awk
/^>/ { prtRec(); rec=""; print; next }
{ rec = rec $0 }
END { prtRec() }
function prtRec( start) {
for ( start=left_trim+1; start<=length(rec); start+=width ) {
print substr(rec,start,width)
}
}
$ awk -v left_trim=15 -v width=70 -f tst.awk file
>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN