Python pySpark的相似性问题_Python_Pyspark_Cosine Similarity

Python pySpark的相似性问题

python pyspark

Python pySpark的相似性问题,python,pyspark,cosine-similarity,Python,Pyspark,Cosine Similarity,tl；博士如何使用pySpark比较行的相似性我有一个numpy数组，我想在其中比较每一行的相似性 print (pdArray) #[[ 0. 1. 0. ..., 0. 0. 0.] # [ 0. 0. 3. ..., 0. 0. 0.] # [ 0. 0. 0. ..., 0. 0. 7.] # ..., # [ 5. 0. 0. ..., 0. 1. 0.] # [ 0. 6. 0. ..., 0. 0. 3.] # [ 0. 0

tl；博士如何使用pySpark比较行的相似性

我有一个numpy数组，我想在其中比较每一行的相似性

print (pdArray)
#[[ 0.  1.  0. ...,  0.  0.  0.]
# [ 0.  0.  3. ...,  0.  0.  0.]
# [ 0.  0.  0. ...,  0.  0.  7.]
# ..., 
# [ 5.  0.  0. ...,  0.  1.  0.]
# [ 0.  6.  0. ...,  0.  0.  3.]
# [ 0.  0.  0. ...,  2.  0.  0.]]

使用scipy，我可以计算余弦相似性，如下所示

pyspark.__version__
# '2.2.0'

from sklearn.metrics.pairwise import cosine_similarity
similarities = cosine_similarity(pdArray)

similarities.shape
# (475, 475)

print(similarities)
array([[  1.00000000e+00,   1.52204908e-03,   8.71545594e-02, ...,
          3.97681174e-04,   7.02593036e-04,   9.90472253e-04],
       [  1.52204908e-03,   1.00000000e+00,   3.96760121e-04, ...,
          4.04724413e-03,   3.65324300e-03,   5.63519735e-04],
       [  8.71545594e-02,   3.96760121e-04,   1.00000000e+00, ...,
          2.62367141e-04,   1.87878869e-03,   8.63876439e-06],
       ..., 
       [  3.97681174e-04,   4.04724413e-03,   2.62367141e-04, ...,
          1.00000000e+00,   8.05217639e-01,   2.69724702e-03],
       [  7.02593036e-04,   3.65324300e-03,   1.87878869e-03, ...,
          8.05217639e-01,   1.00000000e+00,   3.00229809e-03],
       [  9.90472253e-04,   5.63519735e-04,   8.63876439e-06, ...,
          2.69724702e-03,   3.00229809e-03,   1.00000000e+00]])

由于我希望扩展到比我原来的（475行）矩阵大得多的集合，所以我希望通过pySpark使用Spark

from pyspark.mllib.linalg.distributed import RowMatrix

#load data into spark 
tempSpark =  sc.parallelize(pdArray)
mat = RowMatrix(tempSpark)

# Calculate exact similarities
exact = mat.columnSimilarities()

exact.entries.first()
# MatrixEntry(128, 211, 0.004969676943490767)

# Now when I get the data out I do the following...
# Convert to a RowMatrix.
rowMat = approx.toRowMatrix()
t_3 = rowMat.rows.collect()
a_3 = np.array([(x.toArray()) for x in t_3])
a_3.shape
# (488, 749)

正如您所看到的，数据的形状是a）不再是正方形（它应该是正方形，b）具有与原始行数不匹配的尺寸。。。现在它确实匹配（部分匹配）每行中的特征数量（len（pdArray[0]）=749），但我不知道488是从哪里来的

749的出现让我觉得我需要先转换我的数据，对吗

最后，如果是这种情况，为什么维度不是（749，749）？

首先，

ColumnComplomics

方法只返回相似性矩阵上三角部分的非对角项。如果对角线上没有1，则结果相似性矩阵中的整行可能都有0

其次，pyspark

RowMatrix

没有有意义的行索引。因此，本质上，当从

CoordinateMatrix

转换为

RowMatrix

时，

MatrixEntry

中的

值被映射到任何方便的值（可能是一些递增索引）。因此，当您将矩阵转换为

行矩阵

时，很可能会忽略所有0的行，并垂直挤压矩阵

在使用

ColumnCompilations

方法进行计算后，立即检查相似度矩阵的维度可能是有意义的。您可以使用

numRows（）

和

numCols（）

方法来执行此操作

print(exact.numRows(),exact.numCols())

除此之外，听起来确实需要转置矩阵以获得正确的向量相似性。此外，如果出于某种原因需要类似于

RowMatrix

的形式，您可以尝试使用

IndexedRowMatrix

，它确实具有有意义的行索引，并且可以从原始数据中保留行索引l坐标矩阵转换时。

稀疏向量为此显示多少行，rowMat.rows.collect（）？