在python中对齐DNA序列_Python_Alignment_Biopython

在python中对齐DNA序列

python

在python中对齐DNA序列,python,alignment,biopython,Python,Alignment,Biopython,我有数千个DNA序列，范围在100到5000 bp之间，我需要对齐并计算指定对的身份分数。 Biopython pairwise2做得很好，但仅适用于短序列，当序列大小大于2kb时，它会显示严重的内存泄漏，从而导致“MemoryError”，即使使用了“score_only”和“one_alignment_only”选项 whole_coding_scores={} from Bio import pairwise2 for genes in whole_coding: # whole codi

我有数千个DNA序列，范围在100到5000 bp之间，我需要对齐并计算指定对的身份分数。 Biopython pairwise2做得很好，但仅适用于短序列，当序列大小大于2kb时，它会显示严重的内存泄漏，从而导致“MemoryError”，即使使用了“score_only”和“one_alignment_only”选项

whole_coding_scores={}
from Bio import pairwise2
for genes in whole_coding: # whole coding is a <25Mb dict providing DNA sequences
   alignment=pairwise2.align.globalxx(whole_coding[genes][3],whole_coding[genes][4],score_only=True,one_alignment_only=True)
   whole_coding_scores[genes]=alignment/min(len(whole_coding[genes][3]),len(whole_coding[genes][4]))

我知道还有其他用于对齐的工具，但它们主要可以将分数写入输出文件，需要再次读取和解析以检索和使用对齐分数。

是否有任何工具可以像pairwise2一样在python环境中对齐序列并返回对齐分数，但不会出现内存泄漏？

对于全局对齐，可以尝试NWalign。我没有使用它，但似乎你可以恢复你的脚本对齐得分

否则，浮雕工具可能会有所帮助：

首先，我使用了BioPython的针。可以找到一个很好的方法（忽略遗留设计：-））
秒：也许你可以通过使用生成器避免将整组数据存入内存？我不知道你的“整个编码”对象来自哪里。但是，如果它是一个文件，请确保不读取整个文件，然后在内存对象上迭代。例如：

whole_coding = open('big_file', 'rt').readlines() # Will consume memory
但是
如果需要处理，可以编写生成器函数：

def gene_yielder(filename): for line in open('filename', 'rt'): line.strip() # Here you preprocess your data yield line # This will return
然后
基本上，您希望您的程序充当管道：东西通过它流动，并得到处理。准备肉汤时不要将其用作烹饪锅：加入所有材料，并加热。我希望这个比较不要太牵强：-）Biopython can（现在）。Biopython版本中的
成对序列2
模块。1.68速度更快，可能需要更长的序列。以下是新旧配对2的比较（在32位Python 2.7.11上，具有2 GB内存限制，64位Win7，Intel Core i5，2.8 GHz）：

老成对2

最大长度/时间：~1900 chars/10秒

新成对2

最大长度/时间：~7450个字符/12秒

1900字符的时间：1秒

当
score\u only
设置为
True
时，新的pairwise2可以在6秒内完成两个约8400个字符的序列。
实际上，如果不从python中写入和读取数据，似乎没有任何解决方案。我还用Biopython针头解决了这个问题，做了一些额外的工作，但完成了任务。第二，整条格言是一个
for gene in open('big_file', 'rt'): # will not read the whole thing into memory first process(gene)

def gene_yielder(filename): for line in open('filename', 'rt'): line.strip() # Here you preprocess your data yield line # This will return

for gene in gene_yielder('big_file'): process_gene(gene)

from Bio import pairwise2 length_of_sequence = ... seq1 = '...'[:length_of_sequence] # Two very long DNA sequences, e.g. human and green monkey seq2 = '...'[:length_of_sequence] # dystrophin mRNAs (NM_000109 and XM_007991383, >13 kBp) aln = pairwise2.align.globalms(seq1, seq2, 5, -4, -10, -1)