Python Scipy:来自阵列的稀疏指示符矩阵_Python_Numpy_Scipy_Sparse Matrix_Indicator

Python Scipy:来自阵列的稀疏指示符矩阵

python numpy

Python Scipy:来自阵列的稀疏指示符矩阵,python,numpy,scipy,sparse-matrix,indicator,Python,Numpy,Scipy,Sparse Matrix,Indicator,从一个或两个数组a，b计算稀疏布尔矩阵I的最有效方法是什么，其中I[I，j]==True其中a[I]==b[j]？以下操作速度快，但内存效率低： I = a[:,None]==b 在创建过程中，以下操作速度慢且内存效率低： I = csr((a[:,None]==b),shape=(len(a),len(b))) 下面至少给出了行，cols，用于更好的csr\u矩阵初始化，但它仍然创建了全密度矩阵，并且速度同样慢： z = np.argwhere((a[:,None]==b)) 有什么想法

从一个或两个数组

a，b

计算稀疏布尔矩阵

的最有效方法是什么，其中

I[I，j]==True

其中

a[I]==b[j]

？以下操作速度快，但内存效率低：

I = a[:,None]==b

在创建过程中，以下操作速度慢且内存效率低：

I = csr((a[:,None]==b),shape=(len(a),len(b)))

下面至少给出了行，cols，用于更好的

csr\u矩阵

初始化，但它仍然创建了全密度矩阵，并且速度同样慢：

z = np.argwhere((a[:,None]==b))

有什么想法吗？

你可以在小范围内使用：

np.isclose(a,b)

或：

注意：这将返回一个

True

False
数组。一种方法是首先使用set
s识别a
和b
具有共同点的所有不同元素。如果a
和b
中的值没有太多不同的可能性，这应该可以很好地工作。然后，只需循环不同的值（在变量值
下面），并使用np.argwhere
识别a
和b
中出现这些值的索引。然后可以使用np.repeat
和np.tile
构建稀疏矩阵的2D索引：
import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

##identifying all values that occur both in a and b:
values = set(np.unique(a)) & set(np.unique(b))

##here we collect the indices in a and b where the respective values are the same:
rows, cols = [], []

##looping over the common values, finding their indices in a and b, and
##generating the 2D indices of the sparse matrix with np.repeat and np.tile
for value in values:
    x = np.argwhere(a==value).ravel()
    y = np.argwhere(b==value).ravel()    
    rows.append(np.repeat(x, len(x)))
    cols.append(np.tile(y, len(y)))

##concatenating the indices for different values and generating a 1D vector
##of True values for final matrix generation
rows = np.hstack(rows)
cols = np.hstack(cols)
data = np.ones(len(rows),dtype=bool)

##generating sparse matrix
I3 = sparse.csr_matrix( (data,(rows,cols)), shape=(len(a),len(b)) )

##checking that the matrix was generated correctly:
print((I1 != I3).nnz==0)

生成csr矩阵的语法取自。稀疏矩阵等式的测试取自
旧答案：
我不知道性能如何，但至少可以通过使用一个简单的生成器表达式来避免构建全密度矩阵。下面是一些代码，它使用两个1d数组的随机整数来首先生成稀疏矩阵（OP发布的方式），然后使用生成器表达式来测试所有元素是否相等：
import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

## matrix generation using generator
data, rows, cols = zip(
    *((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0)  ## --> True

我认为没有办法绕过双循环，理想情况下，这将被推到numpy
，但至少使用生成器，循环得到了一定程度的优化…
我想我可以使用arga完成整个过程，检测排序数组变化的索引，计算以此方式确定的每个分区大小的组合，再次取消排序。。。但我希望手头有一个简单的numpy或scipy函数……不幸的是，这太慢了：（@radiocontrol你的矩阵有多大？我尝试了15000 x 15000，有10个不同的值，但这当然主要取决于等式矩阵的稀疏性，这取决于不同值的分布。@radiocontrol当然，但我的建议很慢，因为你必须迭代1d数组的所有I和所有j。可能是对你建议的进行排序确实可以加快速度，但我不认为有任何内在的东西…这似乎是一个好的解决方案！我认为第一个建议可以删除。谢谢！在我看来，np.isclose（a[：，None]，b）也会返回一个密集的数组。对于pandas来说也是如此，还需要更多的依赖项。。。
import numpy as np
from scipy import sparse

a = np.random.randint(0, 10, size=(400,))
b = np.random.randint(0, 10, size=(300,))

## matrix generation after OP
I1 = sparse.csr_matrix((a[:,None]==b),shape=(len(a),len(b)))

## matrix generation using generator
data, rows, cols = zip(
    *((True, i, j) for i,A in enumerate(a) for j,B in enumerate(b) if A==B)
)
I2 = sparse.csr_matrix((data, (rows, cols)), shape=(len(a), len(b)))

##testing that matrices are equal
## from https://stackoverflow.com/a/30685839/2454357
print((I1 != I2).nnz==0)  ## --> True