如何测量pyspark中一行和矩阵/表格之间的相似性分数?

如何测量pyspark中一行和矩阵/表格之间的相似性分数?,pyspark,dot-product,Pyspark,Dot Product,我有用户的首选项表: +-------+---- -+-------+-------+-- |user_id|Action| Comedy|Fantasy| +-------+----- +-------+-------+-- | 100 | 0 | 0.33..| 0.66..| | 101 |0.42..| 0.15..| 0.57..| +-------+------+-------+-------+-- 电影类型和内容表: +-------+---- -+-------+

我有用户的首选项表:

+-------+---- -+-------+-------+--
|user_id|Action| Comedy|Fantasy|
+-------+----- +-------+-------+--
|   100 |  0   | 0.33..| 0.66..|
|   101 |0.42..| 0.15..| 0.57..|
+-------+------+-------+-------+--
电影类型和内容表:

+-------+---- -+-------+-------+--
|movieId|Action| Comedy|Fantasy|
+-------+----- +-------+-------+--
|  1001 |  1   |   1   |   0   |
|  1011 |  0   |   1   |   1   |
+-------+------+-------+-------+--
如何获取用户偏好行(按其
用户id
)和每个电影内容行的点积(相似距离),以便按电影类型输出最优先的
电影id
?可以是RDD格式,也可以是数据帧格式。

这是我的尝试

crossProduct
为每个
用户id
将数据帧与
movieId
合并,因此它将创建movieId数据帧的
用户id*的大小

然后,您可以使用带有特定函数的
zip\u对数组的每个元素进行乘法。在这种情况下,
x*y
用于
array1
的每个
x
元素以及
array2
y
元素

最后,您可以聚合数组的乘法结果,即求和。从
sum=0
开始,将
zipArray
x
元素添加到temp变量
sum
,这正是常用的求和函数

from pyspark.sql.functions import array, arrays_zip, expr, rank, desc

df1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")

df1_cols = df1.columns
df1_cols.remove('user_id')
df2_cols = df2.columns
df2_cols.remove('movieId')


df1 = df1.withColumn('array1', array(df1_cols))
df2 = df2.withColumn('array2', array(df2_cols))

df3 = df1.crossJoin(df2)
df3.show(10, False)

+-------+------+------+-------+------------------+-------+------+------+-------+---------+
|user_id|Action|Comedy|Fantasy|array1            |movieId|Action|Comedy|Fantasy|array2   |
+-------+------+------+-------+------------------+-------+------+------+-------+---------+
|100    |0.0   |0.33  |0.66   |[0.0, 0.33, 0.66] |1001   |1     |1     |0      |[1, 1, 0]|
|100    |0.0   |0.33  |0.66   |[0.0, 0.33, 0.66] |1011   |0     |1     |1      |[0, 1, 1]|
|101    |0.42  |0.15  |0.57   |[0.42, 0.15, 0.57]|1001   |1     |1     |0      |[1, 1, 0]|
|101    |0.42  |0.15  |0.57   |[0.42, 0.15, 0.57]|1011   |0     |1     |1      |[0, 1, 1]|
+-------+------+------+-------+------------------+-------+------+------+-------+---------+


df3 = df3.withColumn('zipArray',   expr("zip_with(array1, array2, (x, y) -> x * y)")) \
         .withColumn('dotProduct', expr("aggregate(zipArray, 0D, (sum, x) -> sum + x)"))
                     
df3.show(10, False)

+-------+------+------+-------+------------------+-------+------+------+-------+---------+-----------------+----------+
|user_id|Action|Comedy|Fantasy|array1            |movieId|Action|Comedy|Fantasy|array2   |zipArray         |dotProduct|
+-------+------+------+-------+------------------+-------+------+------+-------+---------+-----------------+----------+
|100    |0.0   |0.33  |0.66   |[0.0, 0.33, 0.66] |1001   |1     |1     |0      |[1, 1, 0]|[0.0, 0.33, 0.0] |0.33      |
|100    |0.0   |0.33  |0.66   |[0.0, 0.33, 0.66] |1011   |0     |1     |1      |[0, 1, 1]|[0.0, 0.33, 0.66]|0.99      |
|101    |0.42  |0.15  |0.57   |[0.42, 0.15, 0.57]|1001   |1     |1     |0      |[1, 1, 0]|[0.42, 0.15, 0.0]|0.57      |
|101    |0.42  |0.15  |0.57   |[0.42, 0.15, 0.57]|1011   |0     |1     |1      |[0, 1, 1]|[0.0, 0.15, 0.57]|0.72      |
+-------+------+------+-------+------------------+-------+------+------+-------+---------+-----------------+----------+


from pyspark.sql import Window

window = Window.partitionBy('user_id').orderBy(desc('dotProduct'))

df3.select('user_id', 'movieId', 'dotProduct') \
   .withColumn('rank', rank().over(window)) \
   .filter('rank = 1') \
   .drop('rank') \
   .show(10, False)

+-------+-------+----------+
|user_id|movieId|dotProduct|
+-------+-------+----------+
|101    |1011   |0.72      |
|100    |1011   |0.99      |
+-------+-------+----------+

谢谢你,先生,它工作得很好!虽然,我很难理解这一行
df3=df1.crossJoin(df2).with column('dotProduct',expr(“聚合(zip_with(array1,array2,(x,y)->x*y),0D,(sum,x)->sum+x)”)
,您能解释一下这里到底发生了什么吗?我建议您查找函数定义,无论如何,我添加了一些注释。感谢您的时间和努力,它既清晰又准确!我应该更改代码的哪一部分,以便根据id号仅从df1中获取给定行,并使用df2中的行计算点积?将csv加载到
df1
中,然后尽快过滤
用户id