PySpark 1.6中是否有更有效的方法实现余弦相似性?
我试图计算用户表中给定的PySpark 1.6中是否有更有效的方法实现余弦相似性?,pyspark,cosine-similarity,Pyspark,Cosine Similarity,我试图计算用户表中给定的user\u id和另一个包含电影的表之间的余弦相似性,以便对最相似的电影进行排序以推荐 余弦相似性:=dot(a,b)/(norm(a)*norm(b))或dot(a,b)/sqrt((dot(a)*dot(b)) 结果输出如下所示: +------+-------+-----+-----+-----+----------+ |userId|movieId|dotxy|dotxx|dotyy|cosine_sim| +------+-------+-----+-----
user\u id
和另一个包含电影的表之间的余弦相似性,以便对最相似的电影进行排序以推荐
余弦相似性:=dot(a,b)/(norm(a)*norm(b))
或dot(a,b)/sqrt((dot(a)*dot(b))
结果输出如下所示:
+------+-------+-----+-----+-----+----------+
|userId|movieId|dotxy|dotxx|dotyy|cosine_sim|
+------+-------+-----+-----+-----+----------+
| 18| 1430| 1.0| 0.5| 2.0| 1.0|
| 18| 2177| 1.0| 0.5| 2.0| 1.0|
| 18| 1565| 1.0| 0.5| 2.0| 1.0|
| 18| 415| 1.0| 0.5| 2.0| 1.0|
| 18| 1764| 1.0| 0.5| 2.0| 1.0|
+------+-------+-----+-----+-----+----------+
PySpark 1.6中是否有更高效/紧凑的余弦相似函数实现方式?您可以使用更多的
numpy
函数
import numpy as np
df = spark.createDataFrame([(18, 1, [1, 0, 1], [1, 1, 1])]).toDF('userId','movieId','user_features','movie_features')
df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], float(np.dot(np.array(x[2]), np.array(x[3])) / (np.linalg.norm(np.array(x[2])) * np.linalg.norm(np.array(x[3])))))).toDF(df.columns + ['cosine_sim']).show()
+------+-------+-------------+--------------+------------------+
|userId|movieId|user_features|movie_features| cosine_sim |
+------+-------+-------------+--------------+------------------+
| 18| 1| [1, 0, 1]| [1, 1, 1]|0.8164965809277259|
+------+-------+-------------+--------------+------------------+
import numpy as np
df = spark.createDataFrame([(18, 1, [1, 0, 1], [1, 1, 1])]).toDF('userId','movieId','user_features','movie_features')
df.rdd.map(lambda x: (x[0], x[1], x[2], x[3], float(np.dot(np.array(x[2]), np.array(x[3])) / (np.linalg.norm(np.array(x[2])) * np.linalg.norm(np.array(x[3])))))).toDF(df.columns + ['cosine_sim']).show()
+------+-------+-------------+--------------+------------------+
|userId|movieId|user_features|movie_features| cosine_sim |
+------+-------+-------------+--------------+------------------+
| 18| 1| [1, 0, 1]| [1, 1, 1]|0.8164965809277259|
+------+-------+-------------+--------------+------------------+