（BioPython）如何停止内存错误：内存不足异常？_Python_Memory_Out Of Memory_Bioinformatics_Biopython

（BioPython）如何停止内存错误：内存不足异常？

python memory

（BioPython）如何停止内存错误：内存不足异常？,python,memory,out-of-memory,bioinformatics,biopython,Python,Memory,Out Of Memory,Bioinformatics,Biopython,我有一个程序，在这个程序中，我获取一对非常大的多个序列文件（>77000个序列，每个序列平均长度约为1000 bp），计算每个成对的单个元素之间的对齐分数，并将该数字写入输出文件（稍后我将加载到excel文件）我的代码适用于小型多序列文件，但我的大型主文件将在分析第16对后抛出以下回溯 Traceback (most recent call last): File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Creat

我有一个程序，在这个程序中，我获取一对非常大的多个序列文件（>77000个序列，每个序列平均长度约为1000 bp），计算每个成对的单个元素之间的对齐分数，并将该数字写入输出文件（稍后我将加载到excel文件）

我的代码适用于小型多序列文件，但我的大型主文件将在分析第16对后抛出以下回溯

Traceback (most recent call last):
  File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 109, in <module>
    cycle(f,k,binLen)
  File "C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\scoreCreate", line 85, in cycle
    a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
  File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 301, in __call__
    return _align(**keywds)
  File "C:\Python26\lib\site-packages\Bio\pairwise2.py", line 322, in _align
    score_only)
MemoryError: Out of memory

回溯（最近一次呼叫最后一次）：
文件“C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\Score Create”，第109行，在
循环（f、k、binLen）
文件“C:\Users\Harry\Documents\cgigas\BioPython Programs\Score Create Program\Score Create”，第85行，在循环中
a=pairwise2.align.localxx（currentSubject.seq，currentQuery.seq，score_only=True）
文件“C:\Python26\lib\site packages\Bio\pairwise2.py”，第301行，在调用中__
返回对齐（**关键字）
文件“C:\Python26\lib\site packages\Bio\pairwise2.py”，第322行，对齐
分数（仅适用于U）
内存错误：内存不足

我已经尝试了很多方法来解决这个问题（正如你们中的许多人从代码中看到的），但都没有用。我已经尝试过将大的主文件拆分成更小的批，以便输入到分数计算方法中。在我使用完del文件之后，我尝试过在Oracle虚拟机上使用我的Ubuntu 11.11（我通常在64位Windows7中工作）。我是不是太雄心勃勃了？这在BioPython计算上可行吗？下面是我的代码，我在内存调试方面没有经验，这显然是这个问题的罪魁祸首。非常感谢您的帮助，我对这个问题感到非常失望

最好的，哈利

##打开参考文件
##a、 ）上传主题列表
##b、 ）上传查询列表（a和b为成对数据）
##循环浏览每个配对的FASTA并获得每个（大文件）的对齐分数
来自Bio import SeqIO
来自Bio导入配对2
导入gc
##批迭代器方法（不是我的代码）
def批处理迭代器（迭代器，批处理大小）：
entry=True#确保循环一次
进入时：
批次=[]
而透镜（批次）<批次尺寸：
尝试：
entry=iterator.next（）
除停止迭代外：
条目=无
如果输入为无：
#文件结束
打破
批处理追加（条目）
如果是批次：
产量批次
def拆分（主题、查询）：
##查询迭代器和批处理主题迭代器
query\u iterator=SeqIO.parse（查询“fasta”）
记录=SeqIO.parse（主题“fasta”）
##将两个大文件写入许多小文件
打印“正在拆分主题文件…”
宾伦=2
对于j，枚举中的batch1（批迭代器（记录iter，binLen））：
filename1=“groupA_u%i.fasta”%（j+1）
handle1=打开（文件名1，“w”）
count1=顺序写入（batch1，handle1，“fasta”）
handle1.close（）
打印“已完成主题文件拆分”
打印“正在拆分查询文件…”
对于k，枚举中的batch2（批迭代器（查询迭代器，binLen））：
filename2=“groupB_u%i.fasta”%（k+1）
handle2=打开（文件名2，“w”）
count2=顺序写入（batch2，handle2，“fasta”）
handle2.close（）
打印“已完成两个FASTA文件的拆分”
打印“”
返回[k，binLen]
##此文件将在制表符删除文本中保存对齐分数
f=打开（“C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt”，“w”）
def循环（f、k、binLen）：
i=1
m=1
虽然我也看到了关于BioStars的类似问题
在那里，我建议为这类事情尝试现有的工具，例如EMBOSS-Needeall（你可以用Biopython解析浮雕对齐输出）
最新版本的Biopython（1.68）更新了pairwise2
模块为了更快、更少地消耗内存。p.S.p.S我不赞成使用biopython，我只需要从这对FASTA文件中获得我的分数列表。成对序列对齐需要制作一个大小为N*M的矩阵，其中N和M是两个序列的长度。1000x1000个整数矩阵是4MB，这应该没有问题，但您的计算机可能会被一对大大大于1KBase的序列阻塞。试着打印出每对的序列长度，看看是否是这样。
    ##Open reference file
##a.)Upload subjectList
##b.)Upload query list (a and b are pairwise data)
## Cycle through each paired FASTA and get alignment score of each(Large file)

from Bio import SeqIO
from Bio import pairwise2
import gc


##BATCH ITERATOR METHOD (not my code)
def batch_iterator(iterator, batch_size) :
    entry = True #Make sure we loop once
    while entry :
        batch = []
        while len(batch) < batch_size :
            try :
                entry = iterator.next()
            except StopIteration :
                entry = None
            if entry is None :
                #End of file
                break
            batch.append(entry)
        if batch :
            yield batch

def split(subject,query):
    ##Query Iterator and Batch Subject Iterator
    query_iterator = SeqIO.parse(query,"fasta")
    record_iter = SeqIO.parse(subject,"fasta")

    ##Writes both large file into many small files
    print "Splitting Subject File..."
    binLen=2
    for j, batch1 in enumerate(batch_iterator(record_iter, binLen)) :
        filename1="groupA_%i.fasta" % (j+1)
        handle1=open(filename1, "w")
        count1 = SeqIO.write(batch1, handle1, "fasta")
        handle1.close()

    print "Done splitting Subject file"
    print "Splitting Query File..."

    for k, batch2 in enumerate(batch_iterator(query_iterator,binLen)):
        filename2="groupB_%i.fasta" % (k+1)
        handle2=open(filename2, "w")
        count2 = SeqIO.write(batch2, handle2, "fasta")
        handle2.close()

    print "Done splitting both FASTA files"
    print " "
    return [k ,binLen]


##This file will hold the alignment scores in a tab deliminated text
f = open("C:\\Users\\Harry\\Documents\\cgigas\\alignScore.txt", 'w')

def cycle(f,k,binLen):
    i=1
    m=1
    while  i<=k+1:
        ##Open the first small file
        subjectFile = open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupA_" + str(i)+".fasta", "rU")
        queryFile =open("C:\\Users\\Harry\\Documents\\cgigas\\BioPython Programs\\groupB_" + str(i)+".fasta", "rU")
        i=i+1
        j=0


        ##Make small file iterators
        smallQuery=SeqIO.parse(queryFile,"fasta")
        smallSubject=SeqIO.parse(subjectFile,"fasta")

        ##Cycles through both sets of FASTA files
        while j<binLen:
                j=j+1
                currentQuery=smallQuery.next()
                currentSubject=smallSubject.next()
                ##Verify every pair is correct
                print " "
                print "Pair: " +  str(m)
                print "Subject: "+ currentSubject.id
                print "Query: " + currentQuery.id
                gc.collect()
                a = pairwise2.align.localxx(currentSubject.seq, currentQuery.seq, score_only=True)
                gc.collect()
                currentQuery=None
                currentSubject=None
                score=str(a)
                a=None
                print "Score: " + score
                f.write("1"+ "\n")
                m=m+1

        smallQuery.close()
        smallSubject.close()
        subjectFile.close()
        queryFile.close()
        gc.collect()
        print "New file"
##MAIN PROGRAM
##Here is our paired list of FASTA files

##subject = open("C:\\Users\\Harry\\Documents\\cgigas\\subjectFASTA.fasta", "rU")
##query =open("C:\\Users\\Harry\\Documents\\cgigas\\queryFASTA.fasta", "rU")
##[k,binLen]=split(subject,query)
k=272
binLen=2
cycle(f,k,binLen)