Python pyspark矩阵蓄能器_Python_Sparse Matrix_Pyspark

Python pyspark矩阵蓄能器

python pyspark

Python pyspark矩阵蓄能器,python,sparse-matrix,pyspark,Python,Sparse Matrix,Pyspark,我想用从rdd中推断出的值，使用；我发现文件有点不清楚。添加一点背景，以防万一相关。 MyrddData包含必须向矩阵中添加一个计数的索引列表。例如，此列表映射到索引： [1,3,4]->（11）、（13）、（14）、（33）、（34）、（44）现在，这是我的累加器： from pyspark.accumulators import AccumulatorParam class MatrixAccumulatorParam(AccumulatorParam): def zero(sel

我想用从

rdd

中推断出的值，使用；我发现文件有点不清楚。添加一点背景，以防万一相关。
My

rddData

包含必须向矩阵中添加一个计数的索引列表。例如，此列表映射到索引：

[1,3,4]->（11）、（13）、（14）、（33）、（34）、（44）

现在，这是我的累加器：

from pyspark.accumulators import AccumulatorParam
class MatrixAccumulatorParam(AccumulatorParam):
    def zero(self, mInitial):
        import numpy as np
        aaZeros = np.zeros(mInitial.shape)
        return aaZeros

    def addInPlace(self, mAdd, lIndex):
        mAdd[lIndex[0], lIndex[1]] += 1
        return mAdd

这是我的映射器函数：

def populate_sparse(lIndices):
    for i1 in lIndices:
        for i2 in lIndices:
            oAccumilatorMatrix.add([i1, i2])

然后运行数据：

oAccumilatorMatrix = oSc.accumulator(aaZeros, MatrixAccumulatorParam())

rddData.map(populate_sparse).collect()

现在，当我查看我的数据时：

sum(sum(oAccumilatorMatrix.value))
#= 0.0

这是不应该的。我错过了什么

编辑首先尝试使用稀疏矩阵，得到了不支持稀疏矩阵的回溯。更改了稠密numpy矩阵的问题：

...

    raise IndexError("Indexing with sparse matrices is not supported"
IndexError: Indexing with sparse matrices is not supported except boolean indexing where matrix and index are equal shapes.

啊哈！我想我明白了。在一天结束时，累加器仍然需要向自身添加自己的片段。因此，将

添加空间

更改为：

def addInPlace(self, mAdd, lIndex):
    if type(lIndex) == list:
        mAdd[lIndex[0], lIndex[1]] += 1
    else:
        mAdd += lIndex
    return mAdd

因此，现在它在给定一个列表时添加索引，并在

populate\u sparse

函数循环后添加自身，以创建最终矩阵

你是个天才。几个小时来我一直在用头撞这个！！