将org.apache.spark.mllib.linalg.Matrix转换为Scala中的spark数据帧
我有一个输入数据框将org.apache.spark.mllib.linalg.Matrix转换为Scala中的spark数据帧,scala,apache-spark,matrix,apache-spark-sql,apache-spark-mllib,Scala,Apache Spark,Matrix,Apache Spark Sql,Apache Spark Mllib,我有一个输入数据框input_dfas: +---------------+--------------------+ |Main_CustomerID| Vector| +---------------+--------------------+ | 725153|[3.0,2.0,6.0,0.0,9.0| | 873008|[4.0,1.0,0.0,1.0,...| | 625109|[1.0,0.0,6.0,1.0,
input_df
as:
+---------------+--------------------+
|Main_CustomerID| Vector|
+---------------+--------------------+
| 725153|[3.0,2.0,6.0,0.0,9.0|
| 873008|[4.0,1.0,0.0,1.0,...|
| 625109|[1.0,0.0,6.0,1.0,...|
| 817171|[0.0,4.0,0.0,7.0,...|
| 611498|[1.0,0.0,4.0,5.0,...|
+---------------+--------------------+
input_df
属于模式类型
root
|-- Main_CustomerID: integer (nullable = true)
|-- Vector: vector (nullable = true)
通过引用,我创建了索引行矩阵,然后执行以下操作:
val lm = irm.toIndexedRowMatrix.toBlockMatrix.toLocalMatrix
查找列之间的余弦相似性。现在我有一个结果mllib
矩阵
cosineSimilarity: org.apache.spark.mllib.linalg.Matrix =
0.0 0.4199605255658081 0.5744269579035528 0.22075539284417395 0.561434614044346
0.0 0.0 0.2791452631195413 0.7259079527665503 0.6206918387272496
0.0 0.0 0.0 0.31792539222893695 0.6997167152675132
0.0 0.0 0.0 0.0 0.6776404124278828
0.0 0.0 0.0 0.0 0.0
现在,我需要将这个lm
类型的org.apache.spark.mllib.linalg.Matrix
转换为数据帧。我希望我的输出dataframe
如下所示:
+---+------------------+------------------+-------------------+------------------+
| _1| _2| _3| _4| _5|
+---+------------------+------------------+-------------------+------------------+
|0.0|0.4199605255658081|0.5744269579035528|0.22075539284417395| 0.561434614044346|
|0.0| 0.0|0.2791452631195413| 0.7259079527665503|0.6206918387272496|
|0.0| 0.0| 0.0|0.31792539222893695|0.6997167152675132|
|0.0| 0.0| 0.0| 0.0|0.6776404124278828|
|0.0| 0.0| 0.0| 0.0| 0.0|
+---+------------------+------------------+-------------------+------------------+
如何在Scala中执行此操作?要将
矩阵
转换为指定的数据帧,请执行以下操作。它首先将矩阵转换为一个数据帧,其中包含一列和一个数组。然后使用foldLeft
将数组拆分为单独的列
import spark.implicits._
val cols = (0 until lm.numCols).toSeq
val df = lm.transpose
.colIter.toSeq
.map(_.toArray)
.toDF("arr")
val df2 = cols.foldLeft(df)((df, i) => df.withColumn("_" + (i+1), $"arr"(i)))
.drop("arr")