Perl 解析GenBank文件

Perl 解析GenBank文件,perl,awk,biopython,bioperl,Perl,Awk,Biopython,Bioperl,基本上,GenBank文件由基因条目组成(由“gene”声明,后跟相应的“CDS”条目(每个基因只有一个条目)就像我在下面展示的两个一样。我想在一个以制表符分隔的两列文件中获取locu_tag vs product。“gene”和“CDS”前面和后面总是加空格。如果可以使用现有工具轻松执行此任务,请让我知道 输入文件: gene complement(8972..9094) /locus_tag="HAPS_0004"

基本上,GenBank文件由基因条目组成(由“gene”声明,后跟相应的“CDS”条目(每个基因只有一个条目)就像我在下面展示的两个一样。我想在一个以制表符分隔的两列文件中获取locu_tag vs product。“gene”和“CDS”前面和后面总是加空格。如果可以使用现有工具轻松执行此任务,请让我知道

输入文件:

 gene            complement(8972..9094)
                 /locus_tag="HAPS_0004"
                 /db_xref="GeneID:7278619"
 CDS             complement(8972..9094)
                 /locus_tag="HAPS_0004"
                 /codon_start=1
                 /transl_table=11
                 /product="hypothetical protein"
                 /protein_id="YP_002474657.1"
                 /db_xref="GI:219870282"
                 /db_xref="GeneID:7278619"
                 /translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
 gene            9632..11416
                 /gene="frdA"
                 /locus_tag="HAPS_0005"
                 /db_xref="GeneID:7278620"
 CDS             9632..11416
                 /gene="frdA"
                 /locus_tag="HAPS_0005"
                 /note="part of four member fumarate reductase enzyme
                 complex FrdABCD which catalyzes the reduction of fumarate
                 to succinate during anaerobic respiration; FrdAB are the
                 catalytic subcomplex consisting of a flavoprotein subunit
                 and an iron-sulfur subunit, respectively; FrdCD are the
                 membrane components which interact with quinone and are
                 involved in electron transfer; the catalytic subunits are
                 similar to succinate dehydrogenase SdhAB"
                 /codon_start=1
                 /transl_table=11
                 /product="fumarate reductase flavoprotein subunit"
                 /protein_id="YP_002474658.1"
                 /db_xref="GI:219870283"
                 /db_xref="GeneID:7278620"
                 /translation="MQTVNVDVAIVGAGGGGLRAAIAAAEANPNLKIALISKVYPMRS
                 HTVAAEGGAAAVAKEEDSYDKHFHDTVAGGDWLCEQDVVEYFVEHSPVEMTQLERWGC
                 PWSRKADGDVNVRRFGGMKIERTWFAADKTGFHLLHTLFQTSIKYPQIIRFDEHFVVD
                 ILVDDGQVRGCVAMNMMEGTFVQINANAVVIATGGGCRAYRFNTNGGIVTGDGLSMAY
                 RHGVPLRDMEFVQYHPTGLPNTGILMTEGCRGEGGILVNKDGYRYLQDYGLGPETPVG
                 KPENKYMELGPRDKVSQAFWQEWRKGNTLKTAKGVDVVHLDLRHLGEKYLHERLPFIC
                 ELAQAYEGVDPAKAPIPVRPVVHYTMGGIEVDQHAETCIKGLFAVGECASSGLHGANR
                 LGSNSLAELVVFGKVAGEMAAKRAVEATARNQAVIDAQAKDVLERVYALARQEGEESW
                 SQIRNEMGDSMEEGCGIYRTQESMEKTVAKIAELKERYKRIKVKDSSSVFNTDLLYKI
                 ELGYILDVAQSISSSAVERKESRGAHQRLDYVERDDVNYLKHTLAFYNADGTPTIKYS
                 DVKITKSQPAKRVYGAEAEAQEAAAKKE"
所需输出(以制表符分隔的两列文件中的轨迹标记与产品):

事实上,有这样的输出是理想的,每个基因一行(只显示一个基因):

输出

HAPS_0004       hypothetical protein
HAPS_0005       fumarate reductase flavoprotein subunit
尝试
 locus_tag="HAPS_0004" db_xref="GeneID:7278619" complement(8972..9094) codon_start=1 transl_table=11 product="hypothetical protein" protein_id="YP_002474657.1" db_xref="GI:219870282" db_xref="GeneID:7278619" translation="MYYKALAHFLPTLSTMQNILSKSPLSLDFRLLFLAFIDKR"
perl -nE'
  BEGIN{ ($/, $") = ("CDS", "\t") }
  say "@r[0,1]" if @r= m!/(?:locus_tag|product)="(.+?)"!g and @r>1
' file
HAPS_0004       hypothetical protein
HAPS_0005       fumarate reductase flavoprotein subunit