PySpark 1.6中是否有更有效的方法实现余弦相似性?

PySpark 1.6中是否有更有效的方法实现余弦相似性?,pyspark,cosine-similarity,Pyspark,Cosine Similarity,我试图计算用户表中给定的user\u id和另一个包含电影的表之间的余弦相似性,以便对最相似的电影进行排序以推荐 余弦相似性:=dot(a,b)/(norm(a)*norm(b))或dot(a,b)/sqrt((dot(a)*dot(b)) 结果输出如下所示: +------+-------+-----+-----+-----+----------+ |userId|movieId|dotxy|dotxx|dotyy|cosine_sim| +------+-------+-----+-----

我试图计算用户表中给定的
user\u id
和另一个包含电影的表之间的余弦相似性,以便对最相似的电影进行排序以推荐

余弦相似性:=
dot(a,b)/(norm(a)*norm(b))
dot(a,b)/sqrt((dot(a)*dot(b))

结果输出如下所示:

+------+-------+-----+-----+-----+----------+
|userId|movieId|dotxy|dotxx|dotyy|cosine_sim|
+------+-------+-----+-----+-----+----------+
|    18|   1430|  1.0|  0.5|  2.0|       1.0|
|    18|   2177|  1.0|  0.5|  2.0|       1.0|
|    18|   1565|  1.0|  0.5|  2.0|       1.0|
|    18|    415|  1.0|  0.5|  2.0|       1.0|
|    18|   1764|  1.0|  0.5|  2.0|       1.0|
+------+-------+-----+-----+-----+----------+

PySpark 1.6中是否有更高效/紧凑的余弦相似函数实现方式?

您可以使用更多的
numpy
函数

import numpy as np

df = spark.createDataFrame([(18, 1, [1, 0, 1], [1, 1, 1])]).toDF('userId','movieId','user_features','movie_features')

df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], float(np.dot(np.array(x[2]), np.array(x[3])) / (np.linalg.norm(np.array(x[2])) * np.linalg.norm(np.array(x[3])))))).toDF(df.columns + ['cosine_sim']).show()

+------+-------+-------------+--------------+------------------+
|userId|movieId|user_features|movie_features|       cosine_sim |
+------+-------+-------------+--------------+------------------+
|    18|      1|    [1, 0, 1]|     [1, 1, 1]|0.8164965809277259|
+------+-------+-------------+--------------+------------------+
import numpy as np

df = spark.createDataFrame([(18, 1, [1, 0, 1], [1, 1, 1])]).toDF('userId','movieId','user_features','movie_features')

df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], float(np.dot(np.array(x[2]), np.array(x[3])) / (np.linalg.norm(np.array(x[2])) * np.linalg.norm(np.array(x[3])))))).toDF(df.columns + ['cosine_sim']).show()

+------+-------+-------------+--------------+------------------+
|userId|movieId|user_features|movie_features|       cosine_sim |
+------+-------+-------------+--------------+------------------+
|    18|      1|    [1, 0, 1]|     [1, 1, 1]|0.8164965809277259|
+------+-------+-------------+--------------+------------------+