Python 将序列解析到字典中_Python_Dictionary_Fasta

Python 将序列解析到字典中

python dictionary

Python 将序列解析到字典中,python,dictionary,fasta,Python,Dictionary,Fasta,我需要最简单的解决方案来转换包含多个核苷酸序列的fasta.txt，如 >seq1 TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG >seq2 TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA AACAGGTGCGCGT

我需要最简单的解决方案来转换包含多个核苷酸序列的fasta.txt，如

>seq1
TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG
CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG
>seq2
TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA
AACAGGTGCGCGTGAACAACAACCCACGGCAAACGAGTACAGTGTGTACGCCTGAGAGTA
>seq3
GGTTCCGCTCTAAGCCTCTAACTCCCGCACAGGGAAGAGATGTCGATTAACTTGCGCCCA
TAGAGCTCTGCGCGTGCGTCGAAGGCTCTTTTCGCGATATCTGTGTGGTCTCACTTTGGT

到dictionary（name，value）对象，其中name将是>头，value将被分配给对应的序列

下面您可以通过2个列表找到我失败的尝试（不适用于包含>1行的长序列）

如果您能为我提供如何修复它的解决方案，并举例说明如何通过单独的函数来实现，我将不胜感激

谢谢你的帮助

Gleb

对代码的简单更正：

from collections import defaultdict #this will make your life simpler
f = open('input2.txt','r')
list=defaultdict(str)
name = ''
for line in f:
    #if your line starts with a > then it is the name of the following sequence
    if line.startswith('>'):
        name = line[1:-1]
        continue #this means skips to the next line
    #This code is only executed if it is a sequence of bases and not a name.
    list[name]+=line.strip()

更新：

因为我收到了一个通知，说这个旧的答案被否决了，所以我决定使用Python3.7展示我现在认为是正确的解决方案。转换为Python 2.7只需要删除键入导入行和函数注释：

from collections import OrderedDict
from typing import Dict

NAME_SYMBOL = '>'


def parse_sequences(filename: str,
                    ordered: bool=False) -> Dict[str, str]:
    """
    Parses a text file of genome sequences into a dictionary.
    Arguments:
      filename: str - The name of the file containing the genome info.
      ordered: bool - Set this to True if you want the result to be ordered.
    """
    result = OrderedDict() if ordered else {}

    last_name = None
    with open(filename) as sequences:
        for line in sequences:
            if line.startswith(NAME_SYMBOL):
                last_name = line[1:-1]
                result[last_name] = []
            else:
                result[last_name].append(line[:-1])

    for name in result:
        result[name] = ''.join(result[name])

    return result

现在，我意识到OP要求的是“最简单的解决方案”，然而，由于他们正在处理基因组数据，似乎可以公平地假设每个序列可能非常大。在这种情况下，有必要通过将序列行收集到一个列表中来进行一点优化，然后在这些列表的末尾使用

str.join

方法来生成最终结果。

最好使用biopython库

from Bio import SeqIO
input_file = open("input.fasta")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))

非常感谢你的建议。另外，您能给我提供一个使用defaultdict和list实例的示例（除了dictionary之外，在单独的列表中保留名称和序列）。最好的是，GlebI知道这是旧的，但我刚刚收到一个通知，这个答案被提升了，所以我会回复你的评论@user3470313。字典可以通过

.keys（）

方法为您提供其中的键列表。如果需要这些键保持有序，可以在collections模块中使用OrderedDict类，但随后必须在循环中添加几行簿记。我将更新我的答案以证明这一点。

from Bio import SeqIO
input_file = open("input.fasta")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))