在VisualStudio中使用python处理两个32GB的文件数天后代码不处理_Python_Bigdata

在VisualStudio中使用python处理两个32GB的文件数天后代码不处理

python

在VisualStudio中使用python处理两个32GB的文件数天后代码不处理,python,bigdata,Python,Bigdata,我试图从一个32GB的文件中的特定行中获取数据，将提取的数据放入字典中，然后读取到另一个32GB的文件中，以使用先前创建的字典中的键和值替换特定行。最后，我试图将所有这些新信息放在一个全新的文件中然而，当我运行这个程序时，它已经运行了12个多小时，而且还在运行。我实现了一个进度条，已经两个小时了，但没有取得百分之一的进展。我没有收到错误消息，但没有看到任何进展。有人知道为什么吗？也许它读这么大的文件有困难？任何帮助都将不胜感激。这是我使用的代码 import gzip from itertoo

我试图从一个32GB的文件中的特定行中获取数据，将提取的数据放入字典中，然后读取到另一个32GB的文件中，以使用先前创建的字典中的键和值替换特定行。最后，我试图将所有这些新信息放在一个全新的文件中

然而，当我运行这个程序时，它已经运行了12个多小时，而且还在运行。我实现了一个进度条，已经两个小时了，但没有取得百分之一的进展。我没有收到错误消息，但没有看到任何进展。有人知道为什么吗？也许它读这么大的文件有困难？任何帮助都将不胜感激。这是我使用的代码

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

from tqdm import tqdm
for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:

            ## only perform operations if the line starts with @
            if line[0] == '@':

                ## split the lines by whitespace
                readID = line.split()

                ## store the read id for each read in a variable
                readID = readID[0]

                ## store the sequence for each read in a variable
                sequence = next(R1)

                ## store the barcode (first 20 characters)
                barcode = sequence[:20]

                ## append reads as keys and barcodes as values respectfully in dictionary
                readID_dictionary[readID] = barcode

###########################################################

########### R2 #############

    ## content that will be in new file
    new_file_content = ""

## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:

    ## for each line in the file
        for line in R2:

        ## if the line starts with @ perform the operations
            if line[0] == '@':

            ## split the lines by whitespace
                readID = line.split()

            ## store the read ID 
                readID = readID[0]

            ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                for key, value in readID_dictionary.items():
                    if readID == key:
                        readID = key + '_' + value

            ## store sequence 
                sequence = next(R2)

            ## store blank (plus sign)
                blank = next(R2)

            ## store quality score
                quality = next(R2)

            ## format the content for the new file
                new_file_content += readID +'\n' + sequence + blank + quality 

###########################################################

########### NEW FILE WITH UPDATED READID+BARCODE #############

## create a new file with the updated read ID
    writing_file = open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w")

## put the content in the new file
    writing_file.write(new_file_content)

## close the file
    writing_file.close()

###########################################################

###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

一般来说，我总是先在一个小得多的输入上运行一个程序（1MB、10MB、100MB？），看看程序是否正常工作，如果正常，每MB需要多长时间。然后，我可以计算完整文件大约需要多长时间，以及在进度中的哪个时间预期的进度

也许您甚至可以在大文件上运行这些小文件测试，以至少看到程序实际工作并最终完成（不丢失当前进度）。首先尝试使用一个非常小的文件（可能是大文件的前1MB），然后如果它工作正常，可能会增加大小

但是，从实际的程序来看，我肯定不会在内存中收集全部数据，而只会在最后编写它。我会不断地写入输出文件。这样做效率更高，并且不会使用与当前程序相同的大量虚拟内存

所以，像这样的事情（没有测试，因为我不能）：

将词典用作词典而不是列表

不要将新文件内容保存在内存中：只需在处理过程中将其写入磁盘即可

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:
            ## only perform operations if the line starts with @
            if line[0] != '@': continue
            readID = line.split()[0]
            ## store the barcode (first 20 characters of next line)
            readID_dictionary[readID] = next(R1)[:20]

###########################################################

########### R2 #############

## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
        with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as newfile:
            for line in R2:
                if line[0] != '@': continue
                readID = line.split()[0]
                ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                if readID in readID_dictionary:
                    readID = readID + '_' + readID_dictionary[readID]
                ## store sequence 
                sequence = next(R2)
                ## store blank (plus sign)
                blank = next(R2)
                ## store quality score
                quality = next(R2)
                ## format the content for the new file
                newfile.write(readID +'\n')
                newfile.write(sequence + blank + quality)

###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

将词典用作词典而不是列表

不要将新文件内容保存在内存中：只需在处理过程中将其写入磁盘即可

import gzip
from itertools import islice
from datetime import datetime
import time
from tqdm import tqdm

for i in tqdm(range(10000)):

    ## store the time the program started running
    start_time = time.time()
    ########### R1 #############

    ## create dictionary that will store the read ID as keys and barcode as values
    readID_dictionary = {}

    ## open the read 1 fastq file as R1
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R1.fastq', 'rt') as R1:
    
        ## for each line in the file, look for the read ID that starts with @ and store the read name, sequence, blank (+), and quality score as variables
        for line in R1:
            ## only perform operations if the line starts with @
            if line[0] != '@': continue
            readID = line.split()[0]
            ## store the barcode (first 20 characters of next line)
            readID_dictionary[readID] = next(R1)[:20]

###########################################################

########### R2 #############

## open R2 .fastq file as R2
    with open('/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_copy.fastq', 'rt') as R2:
        with open("/Users/jakevazquez18/Desktop/Axel_Scripts/UMI_Tools/AD507-noCL.R2_test.fastq", "w") as newfile:
            for line in R2:
                if line[0] != '@': continue
                readID = line.split()[0]
                ## if the read ID matches the read ID from R1 (key in dictionary) then have the read ID in R2 equal that ID with _ and barcode
                if readID in readID_dictionary:
                    readID = readID + '_' + readID_dictionary[readID]
                ## store sequence 
                sequence = next(R2)
                ## store blank (plus sign)
                blank = next(R2)
                ## store quality score
                quality = next(R2)
                ## format the content for the new file
                newfile.write(readID +'\n')
                newfile.write(sequence + blank + quality)

###########################################################

########### REPORT #############

## show how long program took to run
    print("Process finished --- %s seconds ---" % (time.time() - start_time))

###########################################################

什么是

tqdm

？您正在运行这10000次？什么是

tqdm

？你运行了10000次？谢谢你的回答！关于分割处理，你是对的。我还研究了.read（）的chunksize参数，但唯一的问题是我必须一次读取4行文件，并且从以“@”开头的行开始，所以我不想截断任何四行或从错误的位置开始。不过，我会试试你的答案。我喜欢将循环放入输出文件的循环中的想法@IronMan18无需使用.read（）或chunksize；那只会使事情复杂化。但是，如果您希望一次读取更多数据，则可以更改缓冲区大小。不过，我不太确定这会有多大帮助。但是看看rioV8的答案，他们提到你可以把字典当作字典来使用；我认为这可以大大加快程序的速度。当我回答时，我没有仔细研究细节。谢谢你告诉我关于块大小的问题，节省了我的时间。我现在正在尝试这个答案，我仍然很好奇为什么处理时间长时间停留在0%。我认为如果是内存问题，那么它将导致一条错误消息，但它只是继续运行，几个小时（甚至一天）都没有任何进展。“你知道这是为什么吗？”铁人18对后来的回答感到非常抱歉！这里发生的生活环境让我失去了注意力。我知道现在这可能无关紧要，但我需要查看显示进度的实际代码，以了解如何衡量进度，以及何时报告/更新进度，以了解为什么没有（明显的）进度。问题可能在于报告本身。不过，我还是迟到了，很抱歉。谢谢你的回答！关于分割处理，你是对的。我还研究了.read（）的chunksize参数，但唯一的问题是我必须一次读取4行文件，并且从以“@”开头的行开始，所以我不想截断任何四行或从错误的位置开始。不过，我会试试你的答案。我喜欢将循环放入输出文件的循环中的想法@IronMan18无需使用.read（）或chunksize；那只会使事情复杂化。但是，如果您希望一次读取更多数据，则可以更改缓冲区大小。不过，我不太确定这会有多大帮助。但是看看rioV8的答案，他们提到你可以把字典当作字典来使用；我认为这可以大大加快程序的速度。当我回答时，我没有仔细研究细节。谢谢你告诉我关于块大小的问题，节省了我的时间。我现在正在尝试这个答案，我仍然很好奇为什么处理时间长时间停留在0%。我想如果这是一个记忆问题，那么它就会恢复