Parsing 用Python以正确的顺序从文件到字典进行解析
我已经编写了一些代码来解析EMBL文件,并将文件的特定区域转储到字典中 字典的键与我想要捕获的特定区域的标签相关,每个键的值就是区域本身 然后,我创建了另一个函数,将字典的内容写入文本文件 但是,我发现文本文件包含的信息顺序与原始EMBL文件中的不同 我不明白它为什么这样做——是因为字典无序吗?有什么办法吗Parsing 用Python以正确的顺序从文件到字典进行解析,parsing,python-2.7,dictionary,biopython,Parsing,Python 2.7,Dictionary,Biopython,我已经编写了一些代码来解析EMBL文件,并将文件的特定区域转储到字典中 字典的键与我想要捕获的特定区域的标签相关,每个键的值就是区域本身 然后,我创建了另一个函数,将字典的内容写入文本文件 但是,我发现文本文件包含的信息顺序与原始EMBL文件中的不同 我不明白它为什么这样做——是因为字典无序吗?有什么办法吗 from Bio import SeqIO s6633 = SeqIO.read("6633_seq.embl", "embl") def make_dict_realgenes(x):
from Bio import SeqIO
s6633 = SeqIO.read("6633_seq.embl", "embl")
def make_dict_realgenes(x):
dict = {}
for i in range(len(x.features)):
if x.features[i].type == 'CDS':
if 'hypothetical' not in x.features[i].qualifiers['product'][0]:
try:
if x.features[i].location.strand == -1:
x1 = x.features[i].location.end
y1 = x1 + 30
dict[str(x.features[i].qualifiers['product'][0])] =\
str(x[x1:y1].seq.reverse_complement())
else:
x2 = x.features[i].location.start
y2 = x2 - 30
dict[x.features[i].qualifiers['product'][0]] =\
str(x[y2:x2].seq)
except KeyError:
if x.features[i].location.strand == -1:
x1 = x.features[i].location.end
y1 = x1 + 30
dict[str(x.features[i].qualifiers['translation'][0])] =\
str(x[x1:y1].seq.reverse_complement())
else:
x2 = x.features[i].location.start
y2 = x2 - 30
dict[x.features[i].qualifiers['translation'][0]] =\
str(x[y2:x2].seq)
return dict
def rbs_file(dict):
list = []
c = 0
for k, v in dict.iteritems():
list.append(">" + k + " " + str(c) + "\n" + v + "\n")
c = c + 1
f = open("out.txt", "w")
a = 0
for i in list:
f.write(i)
a = a + 1
f.close()
要保留词典中的顺序,请使用
collections
中的orderedict
。尝试将代码顶部更改为:
from collections import OrderedDict
from Bio import SeqIO
s6633 = SeqIO.read("6633_seq.embl", "embl")
def make_dict_realgenes(x):
dict = OrderedDict()
...
此外,如果您可以轻松重命名内置的“dict”,我建议不要覆盖它。我稍微重构了您的代码,我建议在解析文件时按原样编写输出,而不是按顺序转发
from Bio import SeqIO
output = open("out.txt", "w")
for seq in SeqIO.parse("CP001187.embl", "embl"):
for feature in seq.features:
if feature.type == "CDS":
qualifier = (feature.qualifiers.get("product") or
feature.qualifiers.get("translation"))[0]
if "hypothetical" not in qualifier:
if feature.location.strand == -1:
x1 = feature.location.end
x2 = x1 + 30
sequence = seq[x1:x2].seq.reverse_complement()
else:
x1 = feature.location.start
x2 = x1 - 30
sequence = seq[x2:x1].seq
output.write(">" + qualifier + "\n")
output.write(str(sequence) + "\n")
# You can always insert here to the OrderedDict anyway, e.g.
# d[qualifier] = str(sequence)
output.close()
在python中,对于i-In-range(len(anywhere))是一种很好的选择
使用Biopython还有一种更干净的方法来输出序列。使用列表附加序号,而不是dict或ORDERDEDDICT:
from Bio.SeqRecord import SeqRecord
my_seqs = []
# Each time you generate a sequence, instead of writing to a file
# or inserting in dict, do this:
my_seqs.append(SeqRecord(sequence, id=qualifier, description=""))
# Now you have the my_seqs, they can be writen in a single line:
SeqIO.write(my_seqs, "output.fas", "fasta")
是的,字典是无序的。如果订单很重要,请使用
列表
或OrderedDict
。哇,谢谢!但是,对于范围内的i(len(anything))
有什么问题吗?它不是pythonic。如果要循环元素,请对元素列表中的元素使用。如果需要索引,请在enumerate(元素列表)中为i,元素使用
。还有其他原因,比如“元素的列表”如果是生成器,则不必有长度。谢谢-这就是我想要的。