解析fasta序列文件以检索Python中的标题和序列

解析fasta序列文件以检索Python中的标题和序列,python,parsing,fasta,Python,Parsing,Fasta,我必须使用Python制作一个通用解析器来解析fasta文件 格式如下: >gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_1, whole genome shotgun sequence TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA >gi|348686675|gb|JH159151.1| Phyt

我必须使用Python制作一个通用解析器来解析fasta文件

格式如下:

>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_1, whole genome shotgun sequence  
    TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA

>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_2, whole genome shotgun sequence
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA

>gi|348686675|gb|JH159151.1| Phytophthora sojae unplaced genomic scaffold PHYSOscaffold_3, whole genome shotgun sequence
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA
TAATATGCCTATGCCGCATAATTTTTATATCTTTCTCCTAACAAAACATTCGCTTGTAAA
我必须分别检索每个标题和序列,并在我创建的MySQL数据库中插入值

eg: title1 = PHYSOscaffold_1
    sequence2 = TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
    title2 = PHYSOscaffold_2
    sequence1 = CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA AGTTTATATATAAATTTCCTTTTTATTGGA
等等。。。我将这些值插入MySQL表

我的解析的输出应该如下所示:

name1 \t sequence1 \t length_of_sequence \t a_count \t t_count \t g_count \t c_count

name2 \t sequence2 \t length_of_sequence \t a_count \t t_count \t g_count \t c_count
到目前为止,我已经写了这样一个非常基本的脚本:

infile = open("simple.fasta")
line = infile.readline()
if not line.startswith(">"):
raise TypeError("Not a FASTA file: %r" % line)
title = line
sequence_lines = []
while 1:
  line = infile.readline().rstrip()
  if line == "":
    break
  sequence_lines.append(line)
我只得到我的第一个序列和标题


我是新手,需要专家的帮助。

之所以只获得第一个标题和顺序,是因为每次阅读之间的行都是空白的。所以当你这样做的时候:

if line == "":
    break
它将在第一个序列后中断。使用readline()无法检测文件的结尾,因为它只会返回“”

这是一个不雅观的问题解决方案:

infile = open("simple.fasta")
# State variable so we can handle the start of the file properly
# There are probably much better ways to do this.
start = True

# Its much better to iterate over the lines than to use a while 1 loop.
for line in infile.readlines():
    if line.startswith(">"):
        if start:
            start = False
        else:
            # Each time we get here we have complete information for a read
            # You can then store that read in your database.
        sequence_lines = []
        title = line
    else:
        if start:
            raise TypeError("Not a FASTA file: %r" % line)
            start = FALSE
        sequence_lines.append(line)

您不使用BioPython中的例程有什么具体原因吗?由于我的指南有严格的说明,我不能使用BioPython模块的可能重复@赫托尼卡蒙