Python Dict和Forloop与FASTA文件

Python Dict和Forloop与FASTA文件,python,for-loop,dictionary,fasta,Python,For Loop,Dictionary,Fasta,我得到了一个FASTA格式的文件(就像来自这个网站:),该文件给出了特定细菌内的各种蛋白质编码序列。我被要求给出一个完整的计数和文件中包含的每个单代码氨基酸的相对百分比,并返回如下结果: L: 139002 (10.7%) A: 123885 (9.6%) G: 95475 (7.4%) V: 91683 (7.1%) I: 77836 (6.0%) 到目前为止,我所拥有的: #!/usr/bin/python ecoli = open("/home/file_pathway

我得到了一个FASTA格式的文件(就像来自这个网站:),该文件给出了特定细菌内的各种蛋白质编码序列。我被要求给出一个完整的计数和文件中包含的每个单代码氨基酸的相对百分比,并返回如下结果:

L: 139002 (10.7%) 

A: 123885 (9.6%) 

G: 95475 (7.4%) 

V: 91683 (7.1%) 

I: 77836 (6.0%)
到目前为止,我所拥有的:

 #!/usr/bin/python
ecoli = open("/home/file_pathway").read()
counts = dict()
for line in ecoli:
    words = line.split()
    for word in words:
        if word in ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]:
            if word not in counts:
                counts[word] = 1
            else:
                counts[word] += 1

for key in counts:
    print key, counts[key]

我相信这样做是在检索所有大写字母的实例,而不仅仅是包含在蛋白质氨基酸字符串中的那些实例,我如何才能将其仅限于编码序列?我在写如何计算每一个代码的总数时也遇到了困难,只有不包含您想要的内容的行才从
开始,忽略这些:

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import defaultdict
    counts = defaultdict(int) 
    for line in ecoli: # iterate over file object, no need to read all contents into memory
        if line.startswith(">"): # skip lines that start with >
            continue
        for char in line: # just iterate over the characters in the line
            if char in {"A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"}:
                    counts[char] += 1
    total = float(sum(counts.values()))       
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))
您也可以使用collections.Counter dict,因为这些行只包含您感兴趣的内容:

with open("input.fasta") as ecoli: # will close your file automatically
    from collections import Counter
    counts = Counter()
    for line in ecoli: # iterate over file object, no need to read all contents onto memory
        if line.startswith(">"): # skip lines that start with >
            continue
        counts.update(line.rstrip())
    total = float(sum(counts.values()))
    for key,val in counts.items():
        print("{}: {}, ({:.1%})".format(key,val, val / total))

你是正确的,你正在接近这一点的方式,你将计算角色的实例,无论它们在哪里,甚至在描述行中

但是你的代码甚至不会运行,你试过了吗?您有line.split(),但line未定义(以及许多其他错误)。此外,当您在ecoli中的“for string”时:“您已经在一个字符接一个字符了

一个简单的方法是读取文件,拆分换行符,跳过以“>”开头的行,统计您关心的每个字符的数量,并保持所有字符的分析总数

#!/usr/bin/python
ecoli = open("/home/file_pathway.faa").read()
counts = dict()
nucleicAcids = ["A", "C", "D", "E", "F", "G", "H", "I", "K", "L", "M", "N", "P", "Q", "R", "S", "T", "V", "W", "Y"]
for acid in nucleicAcids:
    counts[acid] = 0
total = 0

for line in ecoli.split('\n'):
    if ">" not in line:
        total += len(line)
        for acid in counts.keys():
            counts[acid] += line.count(acid)
使用使它更容易,并且避免使用字典(我喜欢dicts,但在这种情况下,
计数器
确实有意义)

由于
计数器
接受iterables,因此应该可以使用生成器执行此操作:

from collections import Counter
with open(filename) as f:
    counter = Counter(c for line in f if not line.startswith('>')
                      for c in line.strip())
# and now as above
total = float(sum(counter.values()))
for k, v in counter.items():
    print "{}: {} ({:.1%})".format(k, v, v/total)

你的意思是这一行的
words=string.split()
words=line.split()
?你能举一个你正在阅读的文件的例子吗?要回答“我怎样才能把它限制在编码序列”你首先需要准确地定义它的意思(用英语或伪代码,或流程图,或任何你熟悉的东西)。在你知道你在写什么之前,你不能写代码,在他们知道你想写什么之前,没有人能帮你写代码。@k-nut是的,其中很多(数千)种:>gi | 31563518 | ref | NP | u852610.1 |微管相关蛋白1A/1B轻链3A亚型b[智人]KMRFSSPCGKAAVDPADRCKEVQQQQRDQHPSKIPVIIERYKGEKQLPLDKFLVDHVNMSELVKIIRRRLQLNPTQAFFFLLVNQHSMVSVSTPIADIYEQEDGFLYMVYASQETFGF请查看。它可以本地解析FASTA文件,您可以只对序列进行操作,只需对文件进行迭代即可<代码>以open('/home/file_pathway.faa')作为ecoli:for-line-in-ecoli:是一种更好的表达方式。您是对的,但我认为这可能是类的一部分,他们可能知道也可能不知道“with”。尽管如此,我的str.count()建议可能已经过时了。是的,我刚刚意识到,我已经更正了我的代码,使其正常工作。我复制并粘贴了错误版本的答案,Padraic的答案才是正确的选择。我一直在得到syntx:invalid syntax error,指向打印行上的第二个“无效语法错误”。不确定为什么我不能让它工作;我使用的是python 3.4.1你需要
()
当在python3中使用print时,因为它是一个函数。我编辑了实现它的代码!非常感谢!我认为它需要()但当我添加它们时,我用它们替换了“”。再次感谢!!是的。但是你避免自己构造它。所以不要处理不存在的列表、值递增等。(但你是对的:从技术上讲,这是一本字典)
from collections import Counter
with open(filename) as f:
    counter = Counter(c for line in f if not line.startswith('>')
                      for c in line.strip())
# and now as above
total = float(sum(counter.values()))
for k, v in counter.items():
    print "{}: {} ({:.1%})".format(k, v, v/total)