如何测量pyspark中一行和矩阵/表格之间的相似性分数?
我有用户的首选项表:如何测量pyspark中一行和矩阵/表格之间的相似性分数?,pyspark,dot-product,Pyspark,Dot Product,我有用户的首选项表: +-------+---- -+-------+-------+-- |user_id|Action| Comedy|Fantasy| +-------+----- +-------+-------+-- | 100 | 0 | 0.33..| 0.66..| | 101 |0.42..| 0.15..| 0.57..| +-------+------+-------+-------+-- 电影类型和内容表: +-------+---- -+-------+
+-------+---- -+-------+-------+--
|user_id|Action| Comedy|Fantasy|
+-------+----- +-------+-------+--
| 100 | 0 | 0.33..| 0.66..|
| 101 |0.42..| 0.15..| 0.57..|
+-------+------+-------+-------+--
电影类型和内容表:
+-------+---- -+-------+-------+--
|movieId|Action| Comedy|Fantasy|
+-------+----- +-------+-------+--
| 1001 | 1 | 1 | 0 |
| 1011 | 0 | 1 | 1 |
+-------+------+-------+-------+--
如何获取用户偏好行(按其用户id
)和每个电影内容行的点积(相似距离),以便按电影类型输出最优先的电影id
?可以是RDD格式,也可以是数据帧格式。这是我的尝试
crossProduct
为每个用户id
将数据帧与movieId
合并,因此它将创建movieId数据帧的用户id*的大小
然后,您可以使用带有特定函数的zip\u对数组的每个元素进行乘法。在这种情况下,x*y
用于array1
的每个x
元素以及array2
的y
元素
最后,您可以聚合数组的乘法结果,即求和。从sum=0
开始,将zipArray
的x
元素添加到temp变量sum
,这正是常用的求和函数
from pyspark.sql.functions import array, arrays_zip, expr, rank, desc
df1 = spark.read.option("header","true").option("inferSchema","true").csv("test1.csv")
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")
df1_cols = df1.columns
df1_cols.remove('user_id')
df2_cols = df2.columns
df2_cols.remove('movieId')
df1 = df1.withColumn('array1', array(df1_cols))
df2 = df2.withColumn('array2', array(df2_cols))
df3 = df1.crossJoin(df2)
df3.show(10, False)
+-------+------+------+-------+------------------+-------+------+------+-------+---------+
|user_id|Action|Comedy|Fantasy|array1 |movieId|Action|Comedy|Fantasy|array2 |
+-------+------+------+-------+------------------+-------+------+------+-------+---------+
|100 |0.0 |0.33 |0.66 |[0.0, 0.33, 0.66] |1001 |1 |1 |0 |[1, 1, 0]|
|100 |0.0 |0.33 |0.66 |[0.0, 0.33, 0.66] |1011 |0 |1 |1 |[0, 1, 1]|
|101 |0.42 |0.15 |0.57 |[0.42, 0.15, 0.57]|1001 |1 |1 |0 |[1, 1, 0]|
|101 |0.42 |0.15 |0.57 |[0.42, 0.15, 0.57]|1011 |0 |1 |1 |[0, 1, 1]|
+-------+------+------+-------+------------------+-------+------+------+-------+---------+
df3 = df3.withColumn('zipArray', expr("zip_with(array1, array2, (x, y) -> x * y)")) \
.withColumn('dotProduct', expr("aggregate(zipArray, 0D, (sum, x) -> sum + x)"))
df3.show(10, False)
+-------+------+------+-------+------------------+-------+------+------+-------+---------+-----------------+----------+
|user_id|Action|Comedy|Fantasy|array1 |movieId|Action|Comedy|Fantasy|array2 |zipArray |dotProduct|
+-------+------+------+-------+------------------+-------+------+------+-------+---------+-----------------+----------+
|100 |0.0 |0.33 |0.66 |[0.0, 0.33, 0.66] |1001 |1 |1 |0 |[1, 1, 0]|[0.0, 0.33, 0.0] |0.33 |
|100 |0.0 |0.33 |0.66 |[0.0, 0.33, 0.66] |1011 |0 |1 |1 |[0, 1, 1]|[0.0, 0.33, 0.66]|0.99 |
|101 |0.42 |0.15 |0.57 |[0.42, 0.15, 0.57]|1001 |1 |1 |0 |[1, 1, 0]|[0.42, 0.15, 0.0]|0.57 |
|101 |0.42 |0.15 |0.57 |[0.42, 0.15, 0.57]|1011 |0 |1 |1 |[0, 1, 1]|[0.0, 0.15, 0.57]|0.72 |
+-------+------+------+-------+------------------+-------+------+------+-------+---------+-----------------+----------+
from pyspark.sql import Window
window = Window.partitionBy('user_id').orderBy(desc('dotProduct'))
df3.select('user_id', 'movieId', 'dotProduct') \
.withColumn('rank', rank().over(window)) \
.filter('rank = 1') \
.drop('rank') \
.show(10, False)
+-------+-------+----------+
|user_id|movieId|dotProduct|
+-------+-------+----------+
|101 |1011 |0.72 |
|100 |1011 |0.99 |
+-------+-------+----------+
谢谢你,先生,它工作得很好!虽然,我很难理解这一行df3=df1.crossJoin(df2).with column('dotProduct',expr(“聚合(zip_with(array1,array2,(x,y)->x*y),0D,(sum,x)->sum+x)”)
,您能解释一下这里到底发生了什么吗?我建议您查找函数定义,无论如何,我添加了一些注释。感谢您的时间和努力,它既清晰又准确!我应该更改代码的哪一部分,以便根据id号仅从df1中获取给定行,并使用df2中的行计算点积?将csv加载到df1
中,然后尽快过滤用户id