Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/284.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 从稀疏计数矩阵构建期望频率矩阵的更快方法_Python_Performance_Scipy_Sparse Matrix - Fatal编程技术网

Python 从稀疏计数矩阵构建期望频率矩阵的更快方法

Python 从稀疏计数矩阵构建期望频率矩阵的更快方法,python,performance,scipy,sparse-matrix,Python,Performance,Scipy,Sparse Matrix,我有一个包含计数的压缩稀疏行矩阵。我想建立一个矩阵,包含这些计数的预期频率。以下是我当前使用的代码: from scipy.sparse import coo_matrix #m is a csr_matrix col_total = m.sum(axis=0) row_total = m.sum(axis=1) n = int(col_total.sum(axis=1)) A = coo_matrix(m) for i,j in zip(A.row,A.col): m[i,j]=

我有一个包含计数的压缩稀疏行矩阵。我想建立一个矩阵,包含这些计数的预期频率。以下是我当前使用的代码:

from scipy.sparse import coo_matrix

#m is a csr_matrix

col_total = m.sum(axis=0)
row_total = m.sum(axis=1)
n = int(col_total.sum(axis=1))
A = coo_matrix(m)

for i,j in zip(A.row,A.col):
    m[i,j]= col_total.item(j)*row_total.item(i)/n
这在小矩阵上运行良好。在更大的矩阵(>1Gb)上,for循环需要几天才能运行。有什么方法可以让这更快吗?

m.data=(col\u total[:,A.col].A*(row\u total[A.row,:].T.A)/n)[0]
是一种计算
m.data的完全矢量化方法。也许可以清理一下
COLU total
matrix
,因此执行逐元素乘法需要一些额外的语法

我将演示:

In [37]: m=sparse.rand(10,10,.1,'csr')
In [38]: col_total=m.sum(axis=0)
In [39]: row_total=m.sum(axis=1)
In [40]: n=int(col_total.sum(axis=1))

In [42]: A=m.tocoo()

In [46]: for i,j in zip(A.row,A.col):
   ....:         m[i,j]= col_total.item(j)*row_total.item(i)/n
   ....:     

In [49]: m.data
Out[49]: 
array([ 0.39490171,  0.64246488,  0.19310878,  0.13847277,  0.2018023 ,
        0.008504  ,  0.04387622,  0.10903026,  0.37976005,  0.11414632])

In [51]: col_total[:,A.col].A*(row_total[A.row,:].T.A)/n
Out[51]: 
array([[ 0.39490171,  0.64246488,  0.19310878,  0.13847277,  0.2018023 ,
         0.008504  ,  0.04387622,  0.10903026,  0.37976005,  0.11414632]])

In [53]: (col_total[:,A.col].A*(row_total[A.row,:].T.A)/n)[0]
Out[53]: 
array([ 0.39490171,  0.64246488,  0.19310878,  0.13847277,  0.2018023 ,
        0.008504  ,  0.04387622,  0.10903026,  0.37976005,  0.11414632])

要对@hpaulj的答案进行一点扩展,您可以通过直接从
m
中的预期频率和非零元素的行/列索引创建输出矩阵来摆脱
for
循环:

from scipy import sparse
import numpy as np

def fast_efreqs(m):

    col_total = np.array(m.sum(axis=0)).ravel()
    row_total = np.array(m.sum(axis=1)).ravel()

    # I'm casting this to an int for consistency with your version, but it's
    # not clear to me why you would want to do this...
    grand_total = int(col_total.sum())

    ridx, cidx = m.nonzero()            # indices of non-zero elements in m
    efreqs = row_total[ridx] * col_total[cidx] / grand_total

    return sparse.coo_matrix((efreqs, (ridx, cidx)))
为了进行比较,以下是作为函数的原始代码:

def orig_efreqs(m):

    col_total = m.sum(axis=0)
    row_total = m.sum(axis=1)
    n = int(col_total.sum(axis=1))

    A = sparse.coo_matrix(m)
    for i,j in zip(A.row,A.col):
        m[i,j]= col_total.item(j)*row_total.item(i)/n

    return m
在小矩阵上测试等效性:

m = sparse.rand(100, 100, density=0.1, format='csr')
print((orig_efreqs(m.copy()) != fast_efreqs(m)).nnz == 0)
# True
更大矩阵上的基准性能:

In [1]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr')
   .....: orig_efreqs(m)
   .....: 
1 loops, best of 3: 2min 25s per loop

In [2]: %%timeit m = sparse.rand(10000, 10000, density=0.01, format='csr')
   .....: fast_efreqs(m)
   .....: 
10 loops, best of 3: 38.3 ms per loop

row\u总计。项目(i)
应该在分母中吗?我没有看到这个计算应该如何产生预期的频率。我仍然没有看到这个结果应该代表什么。如果你给它喂食
[[1,0],[0,1]]
,你会得到
[[0.5,0],[0,0.5]
。但是如果你给它喂食
[[1,eps],[eps,1]
,其中
eps
是一个很小但非零的数字,你大概会得到
[[0.5,0.5],[0.5,0.5]]
。我们应该担心喂食它吗
导致预期频率大于1?@user2357112代码执行我希望它执行的操作。我唯一关心的是性能。@kormak:如果它执行你希望它执行的操作,那么你希望它执行的操作就非常非常非常奇怪。我们可以建议方法让它运行得更快,但为了获得最佳效果,我们需要了解它。