Python 比较两个不同大小的矩阵以获得一个较大的矩阵-速度改进？_Python_Performance_Matrix_Large Data

Python 比较两个不同大小的矩阵以获得一个较大的矩阵-速度改进？

python performance matrix

Python 比较两个不同大小的矩阵以获得一个较大的矩阵-速度改进？,python,performance,matrix,large-data,Python,Performance,Matrix,Large Data,我有两个矩阵，我需要用它们来创建一个更大的矩阵。每个矩阵只是一个以制表符分隔的文本文件，可以读取。每个矩阵有48个列，每个矩阵具有相同的标识符，行数不同。第一个矩阵是108887x48，第二个是55482x48。对于每个矩阵，每个位置的条目可以是0或1，因此是二进制的。最终输出应将第一个矩阵行ID作为行，第二个矩阵行ID作为列，因此最终矩阵为55482x10887 这里需要发生的是，对于第一个矩阵中每行中的每一个pos，对于第二个矩阵中的每一行，如果每个矩阵的pos（col）为1，则最终矩阵计数

我有两个矩阵，我需要用它们来创建一个更大的矩阵。每个矩阵只是一个以制表符分隔的文本文件，可以读取。每个矩阵有48个列，每个矩阵具有相同的标识符，行数不同。第一个矩阵是108887x48，第二个是55482x48。对于每个矩阵，每个位置的条目可以是0或1，因此是二进制的。最终输出应将第一个矩阵行ID作为行，第二个矩阵行ID作为列，因此最终矩阵为55482x10887

这里需要发生的是，对于第一个矩阵中每行中的每一个pos，对于第二个矩阵中的每一行，如果每个矩阵的pos（col）为1，则最终矩阵计数将增加1。最终矩阵中任何位置的最高值为48，预计剩余0

例如：

mat1
     A B C D
1id1 0 1 0 1
1id2 1 1 0 0
1id3 1 1 1 1
1id4 0 0 1 0

mat2
     A B C D
2id1 1 1 0 0
2id2 0 1 1 0 
2id3 1 1 1 1 
2id4 1 0 1 0

final
     2id1 2id2 2id3 2id4
1id1   1    1    2    0
1id2   2    1    2    1
1id3   2    2    4    2
1id4   0    1    1    1

我有代码来做这件事，但是速度慢得令人痛苦，这也是我主要寻求帮助的地方。我已经尽可能地加快了算法的速度。它已经运行了24小时，只完成了大约25%。我以前让它运行过，最终输出文件是20GB。我对数据库没有经验，可以在这里实现它，如果osomeone能在下面的代码片段中帮助我实现的话

#!/usr/bin/env python

import sys

mat1in = sys.argv[1]
mat2in = sys.argv[2]

print '\n######################################################################################'
print 'Generating matrix by counts from smaller matrices.'
print '########################################################################################\n'

with open(mat1in, 'r') as f:
        cols = [''] + next(f).strip().split('\t')               # First line of matrix is composed of 48 cols
        mat1 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

with open(mat2in, 'r') as f:
        next(f)                                                 # Skip first row, col IDs are taken from mat1
        mat2 = [line.strip().split('\t') for line in f]         # Each line in matrix = 'ID': 0 or 1 per col id

out = open('final_matrix.txt', 'w')                             # Output file

#matrix = []
header = []                                                     # Final matrix header
header.append('')                                               # Add blank as first char in large matrix header
for i in mat2:
        header.append(i[0])                                     # Composed of all mat2 row ids
#matrix.append(header)

print >> out, '\t'.join(header)                                 # First print header to output file

print '\nTotal mat1 rows: ' + str(len(mat1))                    # Get total mat1 rows
print 'Total mat2 rows: ' + str(len(mat2)), '\n'                # Get total mat2 rows
print 'Progress: '                                              # Progress updated as each mat1 id is read

length = len(header)                                            # Length of header, i.e. total number of mat2 ids
totmat1 = len(mat1)                                             # Length of rows (-header), i.e. total number of mat1 ids

total = 0                                                       # Running total - for progress meter
for h in mat1:                                                  # Loop through all mat1 ids - each row in the HC matrix
        row = []                                                # Empty list for new row for large matrix
        row.append(h[0])                                        # Append mat1 id, as first item in each row
        for i in xrange(length-1):                              # For length of large matrix header (add 0 to each row) - header -1 for first '\t'
                row.extend('0')
        for n in xrange(1,49):                                  # Loop through each col id
                for k in mat2:                                  # For every row in mat2
                        if int(h[n]) == 1 and int(k[n]) == 1:   # If the pos (count for that particular col id) is 1 from mat1 and mat2 matrix;
                                pos = header.index(k[0])        # Get the position of the mat2 id
                                row[pos] = str(int(row[pos]) + 1)       # Add 1 to current position in row - [i][j] = [mat1_id][mat2_id]
        print >> out, '\t'.join(row)                            # When row is completed (All columns are compared from both mat1 and mat2 matrices; print final row to large matrix
        total += 1                                              # Update running total
        sys.stdout.write('\r\t' + str(total) + '/' + str(tvh))  # Print progress to screen
        sys.stdout.flush()

print '\n######################################################################################'
print 'Matrix complete.'
print '########################################################################################\n'

下面是在mat1中为ids分析前30次迭代的内容：

######################################################################################
Generating matrix by counts from smaller matrices.
########################################################################################


Total mat1 rows: 108887
Total mat2 rows: 55482

Progress:
        30/108887^C         2140074 function calls in 101.234 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1   70.176   70.176  101.234  101.234 build_matrix.py:3(<module>)
        4    0.000    0.000    0.000    0.000 {len}
    55514    0.006    0.000    0.006    0.000 {method 'append' of 'list' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
  1719942    1.056    0.000    1.056    0.000 {method 'extend' of 'list' objects}
       30    0.000    0.000    0.000    0.000 {method 'flush' of 'file' objects}
    35776   29.332    0.001   29.332    0.001 {method 'index' of 'list' objects}
       31    0.037    0.001    0.037    0.001 {method 'join' of 'str' objects}
   164370    0.589    0.000    0.589    0.000 {method 'split' of 'str' objects}
   164370    0.033    0.000    0.033    0.000 {method 'strip' of 'str' objects}
       30    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}
        2    0.000    0.000    0.000    0.000 {next}
        3    0.004    0.001    0.004    0.001 {open}

我不太明白这段代码的作用（单字母变量名没有帮助）

我的建议是：尽量减少在最内层循环中执行的操作数量。例如，您是否需要在内部级别重新计算

pos

pos = header.index(k[0])

如果可以对嵌套循环

、

和

进行重新排序，则可以减少成本高昂的

list.index

，这是一个O（n）操作。

这只是一个矩阵乘法。你想把第一个矩阵乘以第二个矩阵的转置。要实现高效的矩阵运算，请获取

如果将两个输入矩阵读入dtype

NumPy.int8

的NumPy数组中，则计算简单

m1.dot(m2.T)

最多需要几分钟。

虽然我看到您的代码中有一些低效之处，但没有什么特别突出的。有效优化它的唯一方法是首先对其进行分析，这是除了您之外的任何人都无法做到的。幸运的是，这是一项相当简单的任务-查看结果将告诉您需要处理哪些部分才能获得最快的速度。您的解释非常不清楚。“如果每个矩阵的pos（col）为1”的意思是什么？当你说“那么最终矩阵计数将上升1”时，你指的是哪一个计数？这个计数会存储在哪里？听起来好像你只是想用第一个矩阵乘以第二个矩阵的转置，但这是不可能确定的。@user2357112-我已经添加了一个我希望（快速）完成的小例子。每个矩阵的pos by col基于id（xrange（1,49）中的n）。每个mat1 id的行最初由0填充。如果相同列id的mat2中存在1，则通过迭代所有mat2 id（新矩阵中的标头），计数将增加1。给定mat1 id，每行存储一个值，完成该id后，将其写入输出文件。我将在每次迭代中按id比较所有mat1行和按列标题比较每个mat2 id。@user3358205:最后一行应该是

0 1

而不是

0 1

？如果是这样的话，看起来你只是想要一个矩阵乘法，你没有转置第二个矩阵。（还有，获取NumPy。那个家伙的矩阵乘法可能比你的好，但NumPy的两个版本都比不上。）查看最新的编辑，将列表索引更改为dict查找。我需要k[0]索引来获取标题中的位置，以便知道在给定mat1 id（h）的情况下，在大矩阵的行中更新哪个值。我需要为每个mat1 id（h）迭代所有mat2 id（k），在mat1和mat2中逐列查找值相同的位置。如果它们是[mat1][mat2]，则组合矩阵中的位置将增加值1。

m1.dot(m2.T)