Performance Scipy-Python中高效的并行稀疏矩阵点积_Performance_Numpy_Scipy_Sparse Matrix_Dot Product

Performance Scipy-Python中高效的并行稀疏矩阵点积

performance numpy

Performance Scipy-Python中高效的并行稀疏矩阵点积,performance,numpy,scipy,sparse-matrix,dot-product,Performance,Numpy,Scipy,Sparse Matrix,Dot Product,我有一个非常大（1.5M x 16M）的稀疏csr scipy矩阵a。我需要计算的是每对行的相似性。我将相似性定义为： Assume a and b are two rows of matrix A a = (0, 1, 0, 4) b = (1, 0, 2, 3) Similarity (a, b) = 0*1 + 1*0 + 0*2 + 4*3 = 12 要计算所有成对行的相似性，我使用以下公式（或余弦相似性）：现在pairs[i，j]是所有这些i和j的i行和j行的相似性。这与行的成对

我有一个非常大（1.5M x 16M）的稀疏csr scipy矩阵a。我需要计算的是每对行的相似性。我将相似性定义为：

Assume a and b are two rows of matrix A
a = (0, 1, 0, 4)
b = (1, 0, 2, 3)
Similarity (a, b) = 0*1 + 1*0 + 0*2 + 4*3 = 12

要计算所有成对行的相似性，我使用以下公式（或余弦相似性）：

现在pairs[i，j]是所有这些i和j的i行和j行的相似性。这与行的成对余弦相似性非常相似。所以，如果有一个高效的并行算法来计算两两的余弦相似性，它也会对我有用

问题：这个点产品非常慢，因为它只使用一个cpu（我可以访问服务器上的64个cpu）

我还可以将A和AT导出到一个文件中，并运行任何其他并行执行乘法的外部程序，然后将结果返回给Python程序

有没有更有效的方法做这个点产品？还是并行计算成对相似性？

我最终使用了scikit learn的“余弦”距离度量及其成对距离函数，该函数支持稀疏矩阵，并且高度并行化

sklearn.metrics.pairwise.pairwise_distances(X, Y=None, metric='euclidean', n_jobs=1, **kwds)

我还可以将A划分为n个水平部分，然后使用并行python包运行多个乘法并在以后水平堆叠结果。

我使用

sklearn

编写了自己的实现。它不是并行的，但对于大型矩阵它相当快

from scipy.sparse import spdiags
from sklearn.preprocessing import normalize

def get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix):
    sp_matrix = sp_matrix.tocsr()
    matrix = sp_matrix.dot(sp_matrix.T)
    # zero diagonal
    diag = spdiags(-matrix.diagonal(), [0], *matrix.shape, format='csr')
    matrix = matrix + diag
    return matrix

def get_similarity_by_cosine(sp_matrix):
    sp_matrix = normalize(sp_matrix.tocsr())
    return get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix)

相似性=1-余弦距离

from scipy.sparse import spdiags
from sklearn.preprocessing import normalize

def get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix):
    sp_matrix = sp_matrix.tocsr()
    matrix = sp_matrix.dot(sp_matrix.T)
    # zero diagonal
    diag = spdiags(-matrix.diagonal(), [0], *matrix.shape, format='csr')
    matrix = matrix + diag
    return matrix

def get_similarity_by_cosine(sp_matrix):
    sp_matrix = normalize(sp_matrix.tocsr())
    return get_similarity_by_x_dot_x_greedy_for_memory(sp_matrix)