Apache spark 将SparseVector列分解为包含索引和值的行_Apache Spark_Pyspark

Apache spark 将SparseVector列分解为包含索引和值的行

apache-spark pyspark

Apache spark 将SparseVector列分解为包含索引和值的行,apache-spark,pyspark,Apache Spark,Pyspark,我从IDF转换生成的SparseVector如下所示： user='1234', idf=SparseVector(174, {0: 0.4709, 5: 0.8967, 7: 0.9625, 8: 0.9814,...}) 我想将其分解为以下内容： |index|rating|user| |0 |0.4709|1234| |5 |0.8967|1234| |7 |0.9625|1234| |8 |0.9814|1234| . . . 我的目标是获取这些索引、值元组并

我从IDF转换生成的SparseVector如下所示：

user='1234', idf=SparseVector(174, {0: 0.4709, 5: 0.8967, 7: 0.9625, 8: 0.9814,...})

我想将其分解为以下内容：

|index|rating|user|
|0    |0.4709|1234|
|5    |0.8967|1234|
|7    |0.9625|1234|
|8    |0.9814|1234|
.
.
.

我的目标是获取这些

索引、值

元组并执行ALS步骤。

此任务需要

用户定义函数

：

从pyspark.sql.functions导入自定义项，分解
从pyspark.ml.linalg导入SparseVector、DenseVector
df=spark.createDataFrame([
（'1234'，SparseVector（174，{0:0.4709,5:0.8967,7:0.9625,8:0.9814}））
]).toDF（“用户”、“idf”）
@udf（“地图”）
定义向量作为映射（v）：
如果存在（v，SparseVector）：
return dict（zip（v.index.tolist（），v.values.tolist（））
elif isinstance（v，DenseVector）：
return dict（zip（range（len（v）），v.values.tolist（））
df.选择（“用户”，分解（向量映射（“idf”））。别名（“索引”，“评级”））。显示（）

这将给您带来预期的结果：

+----+-----+------+                                                             
|user|index|rating|
+----+-----+------+
|1234|    0|0.4709|
|1234|    8|0.9814|
|1234|    5|0.8967|
|1234|    7|0.9625|
+----+-----+------+

此任务将需要一个

UserDefinedFunction

：

从pyspark.sql.functions导入自定义项，分解
从pyspark.ml.linalg导入SparseVector、DenseVector
df=spark.createDataFrame([
（'1234'，SparseVector（174，{0:0.4709,5:0.8967,7:0.9625,8:0.9814}））
]).toDF（“用户”、“idf”）
@udf（“地图”）
定义向量作为映射（v）：
如果存在（v，SparseVector）：
return dict（zip（v.index.tolist（），v.values.tolist（））
elif isinstance（v，DenseVector）：
return dict（zip（range（len（v）），v.values.tolist（））
df.选择（“用户”，分解（向量映射（“idf”））。别名（“索引”，“评级”））。显示（）

这将给您带来预期的结果：

+----+-----+------+                                                             
|user|index|rating|
+----+-----+------+
|1234|    0|0.4709|
|1234|    8|0.9814|
|1234|    5|0.8967|
|1234|    7|0.9625|
+----+-----+------+

是的，所以对我来说似乎有效的是：

@udf（returnType=MapType（LongType（），DoubleType（））

有什么想法可以解释为什么这样做，但字符串不行吗？一点线索都没有。我已经用2.3和2.4测试了这个问题，但无法重现这个问题。此外，在2.2中已经解决了这个问题，所以它应该适用于所有版本，其中decorator和type都起作用。听起来有点像范围中没有

SparkContext

。是的，所以对我来说似乎有效的是：

@udf（returnType=MapType（LongType（），DoubleType（））

有什么想法为什么行得通但字符串不行吗？没有线索。我已经用2.3和2.4测试了这个问题，但无法重现这个问题。此外，在2.2中已经解决了这个问题，所以它应该适用于所有版本，其中decorator和type都起作用。听起来有点像范围中没有

SparkContext

。