如何使Python脚本更快？_Python_Performance_Bioinformatics_Fastq

如何使Python脚本更快？

python performance

如何使Python脚本更快？,python,performance,bioinformatics,fastq,Python,Performance,Bioinformatics,Fastq,我对Python非常陌生，我编写了一个（可能非常难看）脚本，该脚本应该从fastq文件中随机选择序列的子集。fastq文件以每行四行的块存储信息。每个块中的第一行以字符“@”开头。我用作输入文件的fastq文件是36 GB，包含大约14000000行我试图重写一个已经存在的使用了太多内存的脚本，并且我设法减少了很多内存的使用。但是这个脚本需要很长时间才能运行，我不明白为什么 parser = argparse.ArgumentParser() parser.add_argument("infi

我对Python非常陌生，我编写了一个（可能非常难看）脚本，该脚本应该从fastq文件中随机选择序列的子集。fastq文件以每行四行的块存储信息。每个块中的第一行以字符“@”开头。我用作输入文件的fastq文件是36 GB，包含大约14000000行

我试图重写一个已经存在的使用了太多内存的脚本，并且我设法减少了很多内存的使用。但是这个脚本需要很长时间才能运行，我不明白为什么

parser = argparse.ArgumentParser()
parser.add_argument("infile", type = str, help = "The name of the fastq input file.", default = sys.stdin)
parser.add_argument("outputfile", type = str, help = "Name of the output file.")
parser.add_argument("-n", help="Number of sequences to sample", default=1)
args = parser.parse_args()


def sample():
    linesamples = []
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    # make a list of the lines that are to be fetched from the fastq file:
    for i in seqsamples:
        linesamples.append(int(4*i+0))
        linesamples.append(int(4*i+1))
        linesamples.append(int(4*i+2))
        linesamples.append(int(4*i+3))
    # fetch lines from input file and write them to output file.
    for i, line in enumerate(infile):
        if i in linesamples:
            outputfile.write(line)

grep步骤几乎不需要任何时间，但是在超过500分钟之后，脚本仍然没有开始写入输出文件。因此，我想这是grep和最后一个for循环之间的一个步骤，需要花费很长时间。但我不知道具体是哪一步，以及我能做些什么来加快速度。

尝试并行化您的代码。我的意思是这个。您有14000000行输入

首先使用grep并过滤行，然后将其写入filteredInput.txt

将filteredInput拆分为10.000-100.000行文件，如filteredInput001.txt、filteredInput002.txt

在这个拆分文件上使用我们的代码。将输出写入不同的文件，如output001.txt、output002.txt

合并结果作为最后一步

因为你的代码根本不起作用。您也可以在这些过滤输入上运行代码。您的代码将检查filteredInput文件是否存在，并了解他所处的步骤，然后从该步骤继续

您还可以使用shell或python线程以这种方式使用多个python进程（在步骤1之后）

根据

linesamples

的大小，

如果我在linesamples中

将花费很长时间，因为您正在通过

infle

搜索每个迭代的列表。您可以将其转换为

集合

，以缩短查找时间。另外，

enumerate

不是很有效-我已经用一个

line_num

构造替换了它，我们在每次迭代中都会增加这个构造

def sample():
    linesamples = set()
    infile = open(args.infile, 'r')
    outputfile = open(args.outputfile, 'w')
    # count the number of fastq "chunks" in the input file:
    seqs = subprocess.check_output(["grep", "-c", "@", str(args.infile)])
    # randomly select n fastq "chunks":
    seqsamples = random.sample(xrange(0,int(seqs)), int(args.n))
    for i in seqsamples:
        linesamples.add(int(4*i+0))
        linesamples.add(int(4*i+1))
        linesamples.add(int(4*i+2))
        linesamples.add(int(4*i+3))
    # make a list of the lines that are to be fetched from the fastq file:
    # fetch lines from input file and write them to output file.
    line_num = 0
    for line in infile:
        if line_num in linesamples:
            outputfile.write(line)
        line_num += 1
    outputfile.close()

您说过grep运行得很快，因此在这种情况下，不只是使用grep计算@have grep的出现次数，而是输出它看到的每个@character的字节偏移量（使用grep的

-b

选项）。然后，使用

random.sample

选择所需的块。选择所需的字节偏移量后，使用

infle.seek

转到每个字节偏移量并从那里打印出4行。

您可以使用该算法。使用此算法，您只需读取一次数据（无需提前计算文件的行数），因此可以通过脚本传递数据。这里有一个python示例维基百科页面中的代码

Heng Li中的FASQ采样也有C实现。

你应该考虑你的程序，看看哪些步骤是挂起的。你也试过在一个较小的文件上运行你的代码，看看它是否运行到完成？在优化过程中，我将考虑的另一个步骤是使用线程和多重处理来分配工作。不要在循环内一直调用<代码> INT/COM>。另外，使用

with

语句。在优化算法之前建议并行化可能不是一个好主意。如果使用正确的算法，IO将成为瓶颈，而不是CPU。@cel他的代码现在甚至不起作用，但拆分问题和并行化不是一个好主意。如果我使用linesamples，我认为这个答案正确地识别了瓶颈

。但是，显式关闭文件句柄可能是个好主意：）enumerate
效率不高吗？你有什么基准来证明吗？我同意@Matthias；我的计时显示，在CPython的2.7和3.4版本上，enumerate的速度更快。在任何情况下，与实际开销相比，开销可以忽略不计。-1用于传播虚假信息<代码>枚举

更快。您所做的只是将

enumerate

处理的一些优化的C代码替换为几个python字节码。此外，即使是出于其预期的语义，也应该首选它。泛型计数器可以做很多事情，但是

enumerate

的返回值具有特定的含义。