Python 解析GTF基因文件
我有一个试图解析的基因GTF文件,因此“基因id”、“基因类型”、“基因状态”、“基因名称”和级别都在单独的列中 因此,对于我的原始文件:Python 解析GTF基因文件,python,parsing,command-line,Python,Parsing,Command Line,我有一个试图解析的基因GTF文件,因此“基因id”、“基因类型”、“基因状态”、“基因名称”和级别都在单独的列中 因此,对于我的原始文件: chr1 | ENSEMBL gene| 17369| 17436| . - . |gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3; chr1 | ENSEMBL gene| 30
chr1 | ENSEMBL gene| 17369| 17436| . - . |gene_id "ENSG00000278267.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-1"; level 3;
chr1 | ENSEMBL gene| 30366| 30503| . + . |gene_id "ENSG00000274890.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR1302-2"; level 3;
chr1 | ENSEMBL gene| 157784| 157887| . - . |gene_id "ENSG00000222623.1"; gene_type "snRNA"; gene_status "KNOWN"; gene_name "RNU6-1100P"; level 3;
chr1 | ENSEMBL gene| 187891| 187958| . - . |gene_id "ENSG00000273874.1"; gene_type "miRNA"; gene_status "KNOWN"; gene_name "MIR6859-2"; level 3;
我希望它看起来像这样,“基因id”、“基因类型”、“基因状态”、“基因名称”和级别都在单独的列中:
chr1 |ENSEMBL |gene| 17369| |17436 |. - . |gene_id "ENSG00000278267.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR6859-1" |level 3
chr1 |ENSEMBL |gene| 30366| 30503 |. + . |gene_id "ENSG00000274890.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR1302-2" |level 3
chr1 |ENSEMBL |gene| 157784| 157887 |. - . |gene_id "ENSG00000222623.1" |gene_type "snRNA" |gene_status "KNOWN" |gene_name "RNU6-1100P" |level 3
chr1 |ENSEMBL |gene| 187891| 187958 |. - . |gene_id "ENSG00000273874.1" |gene_type "miRNA" |gene_status "KNOWN" |gene_name "MIR6859-2" |level 3
我已经尝试使用gffutils解析它,使用它们提供的基本代码:
import gffutils
db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')
print(list(db.featuretypes()))
# Here's how to write genes out to file
with open('sRNA.gene.gtf', 'w') as fout:
for gene in db.features_of_type('gene'):
fout.write(str(gene) + '\n')
但是,我收到一个“ImportError:无法导入名称”功能:'
ImportError Traceback (most recent call last)
<ipython-input-26-4dd7cd5c7e24> in <module>()
2
3
----> 4 db = gffutils.create_db("sRNA.gene.gtf", dbfn='sRNA.gene.gtf.db')
5
6 #db = gffutils.FeatureDB('sRNA.gene.gtf.db')
ImportError回溯(最近一次调用)
在()
2.
3.
---->4db=gffutils.create_db(“sRNA.gene.gtf”,dbfn='sRNA.gene.gtf.db')
5.
6#db=gffutils.FeatureDB('sRNA.gene.gtf.db'))
我不确定这里出了什么问题,现在我正在考虑尝试使用命令行解析它。谁能提供一些关于解析GTF文件的最佳方法的建议
先谢谢你 要将GTF文件中的多个分隔符更改为单个制表符分隔符。完成此操作后,该文件不再是GTF文件 以下代码将GTF文件的内容转换为文本文件
import gffutils
try:
db = gffutils.create_db("sample.gtf", dbfn='sample.db')
except:
pass
db = gffutils.FeatureDB('sample.db', keep_order=True)
with open('sample.txt', 'w') as fout:
for line in db.all_features():
line = str(line)
line = line.split(";") #make your parsing changes here
fout.write(str(line) + '\n')
请注意,只能使用create_db()
方法一次。这就是我把它注释掉的原因
编辑
添加了try语句您可以使用该库解析gtf/gff,然后将属性列中的每个条目作为一个列获取
安装说明:
# pip install pyranges
# or
# conda install -c bioconda pyranges
示例文件:
# !head ensembl.gtf
# #!genome-build GRCh38.p10
# #!genome-version GRCh38
# #!genome-date 2013-12
# #!genome-build-accession NCBI:GCA_000001405.25
# #!genebuild-last-updated 2017-06
# 1 havana gene 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene";
# 1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; tag "basic"; transcript_support_level "1";
# 1 havana exon 11869 12227 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "1"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002234944"; exon_version "1"; tag "basic"; transcript_support_level "1";
# 1 havana exon 12613 12721 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "2"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00003582793"; exon_version "1"; tag "basic"; transcript_support_level "1";
# 1 havana exon 13221 14409 . + . gene_id "ENSG00000223972"; gene_version "5"; transcript_id "ENST00000456328"; transcript_version "2"; exon_number "3"; gene_name "DDX11L1"; gene_source "havana"; gene_biotype "transcribed_unprocessed_pseudogene"; transcript_name "DDX11L1-202"; transcript_source "havana"; transcript_biotype "processed_transcript"; exon_id "ENSE00002312635"; exon_version "1"; tag "basic"; transcript_support_level "1";
使用吡喃:
import pyranges as pr
# as PyRanges-object
gr = pr.read_gtf("ensembl.gtf")
# +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+------------------------------------+-----------------+----------------------+-------+
# | Chromosome | Source | Feature | Start | End | Score | Strand | Frame | gene_id | gene_version | gene_name | gene_source | gene_biotype | transcript_id | transcript_version | +13 |
# | (category) | (object) | (category) | (int32) | (int32) | (object) | (category) | (object) | (object) | (object) | (object) | (object) | (object) | (object) | (object) | ... |
# |--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+------------------------------------+-----------------+----------------------+-------|
# | 1 | havana | gene | 11869 | 14409 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | nan | nan | ... |
# | 1 | havana | transcript | 11869 | 14409 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | 1 | havana | exon | 11869 | 12227 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | 1 | havana | exon | 12613 | 12721 | . | + | . | ENSG00000223972 | 5 | DDX11L1 | havana | transcribed_unprocessed_pseudogene | ENST00000456328 | 2 | ... |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | 1 | ensembl | transcript | 120725 | 133723 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 133374 | 133723 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 129055 | 129223 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# | 1 | ensembl | exon | 120874 | 120932 | . | - | . | ENSG00000238009 | 6 | AL627309.1 | ensembl_havana | lincRNA | ENST00000610542 | 1 | ... |
# +--------------+------------+--------------+-----------+-----------+------------+--------------+------------+-----------------+----------------+-------------+----------------+------------------------------------+-----------------+----------------------+-------+
# Stranded PyRanges object has 95 rows and 28 columns from 1 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
# 13 hidden columns: transcript_name, transcript_source, transcript_biotype, tag, transcript_support_level, exon_number, exon_id, exon_version, (assigned, previous, ccds_id, protein_id, protein_version
# as DataFrame
df = gr.df
# Chromosome Source Feature Start End Score Strand Frame gene_id gene_version gene_name ... transcript_biotype tag transcript_support_level exon_number exon_id exon_version (assigned previous ccds_id protein_id protein_version
# 0 1 havana gene 11869 14409 . + . ENSG00000223972 5 DDX11L1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
# 1 1 havana transcript 11869 14409 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 NaN NaN NaN NaN NaN NaN NaN NaN
# 2 1 havana exon 11869 12227 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 1 ENSE00002234944 1 NaN NaN NaN NaN NaN
# 3 1 havana exon 12613 12721 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 2 ENSE00003582793 1 NaN NaN NaN NaN NaN
# 4 1 havana exon 13221 14409 . + . ENSG00000223972 5 DDX11L1 ... processed_transcript basic 1 3 ENSE00002312635 1 NaN NaN NaN NaN NaN
# .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
# 90 1 havana exon 110953 111357 . - . ENSG00000238009 6 AL627309.1 ... lincRNA NaN 5 3 ENSE00001879696 1 NaN NaN NaN NaN NaN
# 91 1 ensembl transcript 120725 133723 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 NaN NaN NaN NaN NaN NaN NaN NaN
# 92 1 ensembl exon 133374 133723 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 1 ENSE00003748456 1 NaN NaN NaN NaN NaN
# 93 1 ensembl exon 129055 129223 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 2 ENSE00003734824 1 NaN NaN NaN NaN NaN
# 94 1 ensembl exon 120874 120932 . - . ENSG00000238009 6 AL627309.1 ... lincRNA basic 5 3 ENSE00003740919 1 NaN NaN NaN NaN NaN
#
# [95 rows x 28 columns]
请根据您的示例输入,编辑您的问题以包含您所需的输出。祝你好运。添加了更改,谢谢!很难看出您的输入和输出之间的差异。您可以在列之间切换到使用
|
字符吗?然后,是否将其加载到excel或类似文件中?祝你好运。在原始版本中,在“gene_id”之后,所有变量都分组在一行中,我希望所有变量都通过它们的标识符进入不同的列中。我看到的是代码>字符被删除。够了吗<代码>sed的/;//g'file>outFile
将执行此操作。否则,需要知道是否有用于分隔字段的字符。i、 e.标签分开?祝你好运,谢谢。我尝试了这个,但得到了错误“DatabaseError:文件已加密或不是数据库”@espop23您可能需要取消对我代码第二行的注释,并将sample.gtf
重命名为sRNA.gene.gtf
谢谢,然后它会说“OperationalError:表功能已经存在”though@espop23好啊现在将注释符号放回第二行代码前面,再试一次。我尝试了这些更改,但结果是:“['chr1\tHAVANA\tgene\t29554\t31109\t.\t+\t.\tgene\u id“ENSG0000243485.3”、'gene\u type“lincRNA”、'gene\u status“KNOWN”、'gene\u name“RP11-34P13.3”、'level“2”、'tag“ncRNA\u host”、'havana\u gene”“Otthumg0000000959.2”