如何在Python中执行有效的合并_Python_Merge_Concatenation_Memory Efficient

如何在Python中执行有效的合并

python merge

如何在Python中执行有效的合并,python,merge,concatenation,memory-efficient,Python,Merge,Concatenation,Memory Efficient,我有大约50个大数据集，有大约200K-500K个列，我正试图想出一种有效合并/连接这些数据集的方法。对这些文件进行条件列连接（合并）的最快方法是什么目前，我有一个代码可以工作，如下所示，但这段代码需要几个小时（至少12小时）才能完成数据集的工作。请记住，这些输入文件（数据集）将非常大，有没有办法调整此代码以使用尽可能少的内存？我（通过查看下面的代码）得到的一个线索是在打开文件后关闭它们，但我不知道如何做到这一点 Note that: a. All files have the same n

我有大约50个大数据集，有大约200K-500K个列，我正试图想出一种有效合并/连接这些数据集的方法。对这些文件进行条件列连接（合并）的最快方法是什么

目前，我有一个代码可以工作，如下所示，但这段代码需要几个小时（至少12小时）才能完成数据集的工作。请记住，这些输入文件（数据集）将非常大，有没有办法调整此代码以使用尽可能少的内存？我（通过查看下面的代码）得到的一个线索是在打开文件后关闭它们，但我不知道如何做到这一点

Note that:
a.  All files have the same number of rows
b.  The first two columns are the same throughout the files
c.  All files are tab delimited
d.  This code works but it is ridiculously slow!

我下面的代码适用于示例数据集。与我的大型数据集一样，下面的数据集具有相同的前两列。我很感激任何关于如何使代码高效运行的反馈或建议，或者其他有效完成工作的方法

Input 1: test_c1_k2_txt.gz :-
c1  c2  1.8 1.9 1.7
L1  P   0.5 1.4 1.1
L2  P   0.4 1.8 1.2
L3  P   0.1 1.9 1.3

Input 2: test_c1_k4_txt.gz :-
c1  c2  0.1 0.9 1.1 1.2
L1  P   1.8 1.7 1.8 2.8
L2  P   1.3 1.4 1.2 1.1
L3  P   1.7 1.6 1.5 1.4

Input 3: test_c3_k1_txt.gz :-
c1  c2  1.3 1.4
L1  P   1.1 2.9
L2  P   2.2 1.4
L3  P   1.7 1.6

Output : - test_all_c_all_k_concatenated.txt.gz :-
c1  c2  1.8 1.9 1.7 0.1 0.9 1.1 1.2 1.3 1.4
L1  P   0.5 1.4 1.1 1.8 1.7 1.8 2.8 1.1 2.9
L2  P   0.4 1.8 1.2 1.3 1.4 1.2 1.1 2.2 1.4
L3  P   0.1 1.9 1.3 1.7 1.6 1.5 1.4 1.7 1.6

用于合并/连接的Python代码

import os,glob,sys,gzip,time


start_time=time.time()

max_c=3
max_k=4

filearr=[]

# Loop through the files, in the order of “c” first and then in the order of “k” and create a file array
for c in range(1,max_c):
    for k in range(1,max_k):
    # Set my string of file name
        fname= "test_c"+str(c)+"_k"+str(k)+"_txt.gz"
    # If the file name specified exists, ..
        if os.path.exists(fname):
            print ("Input file "+ fname+ " exists ... ")
        # Open files and create a list array
            files=[gzip.open(f) for f in glob.glob(fname)]
        filearr=filearr+files

# Initialize a list array to append columns to
d=[]
for file in filearr:
    # row strip each line for each file
    row_list=[line.rstrip().split('\t') for line in file.readlines()]
    # Transpose the list array to make columns for each file
    row_list_t=[[r[col] for r in row_list] for col in range(len(row_list[0]))]
    # Combine the transposed rows from each file into one file
    d=d+row_list_t

# Initialize an empty array
temp=[]
for i in (d):
        # Append new columns each time
    if i not in temp:
         temp.append(i)
appended=[[r[col] for r in temp] for col in range(len(temp[0]))]

# Write output dataset into a tab delimited file
outfile=gzip.open('all_c_all_k_concatenated.txt.gz','w')
for i in appended:
    for j in i[:-1]:
        outfile.write(j+'\t')
    outfile.write(i[-1]+'\n')
outfile.close()
print 'executed prob file concatenation sucessfully. '

total_time=time.time() - start_time
print "Total time it took to finish: ", total_time

你的代码很难读；但是，我可以看到这里正在执行两个O（N^2）操作

第一个是在循环内执行

d=d+row\u list\t

。该操作每次都会创建一个新列表，因此O（N）使循环处于O（N^2）中。切换到使用append方法来改进这一点

第二个是执行

，如果我不在temp:

。搜索一个列表是O（N），这使您的循环是O（N^2）。添加用于存在性检查的集合的使用，以修复此问题。（所需的额外O（N）内存与已经使用的内存相比不算什么），并且值得加速

然而，这并不能解决你所有的问题；可能还有更多，因此最好的方法是在程序开始时导入时间，然后在程序的每个部分之前调用打印时间（）。这将让您了解哪些部分的运行速度比其他部分慢，您可以尝试找出解决这些问题的方法。

以下代码是处理数据合并问题的有效方法。它会打开所有文件。然后从第一个数据文件复制第一行——这是两列标题加上所有值。接下来，对于除第一个输入文件外的每个输入文件，它读取一行，转换前两个标题列，并将其写入输出数据集。每个输入文件的值都与其他文件分开

玩得开心

#!/usr/bin/env python

import glob, gzip, re

data_files = [ gzip.open(name) for name in sorted(
    glob.glob('*_txt.gz')
) ]

# we'll use the two header columns from the first file
firstf = data_files.pop(0)

outf = gzip.open('all_c_all_k_concatenated.txt.gz', 'w')
for recnum,fline in enumerate( firstf ):

    print 'record', recnum+1

    # output header columns plus first batch of data
    outf.write( fline.rstrip() )

    # separate first file's values from others
    outf.write( ' ' )

    # for each input, read one line of data, write values
    for dataf in data_files:
        # read line with headers and values
        line = dataf.next()

        # zap two header columns
        line = re.sub(r'^\S+\s+\S+\s+', '', line)

        outf.write( line.rstrip() )

        # separate this file's values from next
        outf.write( ' ' )

    # finish the line of data
    outf.write( '\n' )

outf.close()

你需要找到瓶颈，找出哪些部件需要很长时间。。。（例如，在每个步骤周围放置计时器（或者更好地使用cprofile）（甚至可以将每个步骤转换为函数（这不会导致任何加速，只是使其更可读…）（我没有投反对票……但投反对票的人可能投了反对票，因为这是真的，我敢打赌，如果我不在temp:check；

temp

是一个列表，所以成员资格测试是最坏情况下的O（N），这意味着整个循环将是

O（N^2）

。OP也应该使用有意义的变量名

告诉我们什么都不是…

i，j

是典型的索引，但您将它们用作值…抱歉，我不知道这篇文章不应该属于堆栈溢出。谢谢，这些是有用的注释。我估计了每个块（每一步）的长度需要运行，而且转置每个矩阵的for循环似乎需要最长的时间。这确实很有帮助，内存成本更低，效率更高，但仍然需要与上述代码一样长的时间。这是您的修订版。我更改了它，因为首先，它没有正确地排序文件（我希望文件的顺序是c，然后是k），因此我编写了另一个for循环来创建文件数组。其次，输出的格式不正确，因此在编写文件时插入了制表符而不是空格：