Apache spark Spark中相关矩阵值中的行/列名_Apache Spark_Apache Spark Sql_Spark Dataframe_Apache Spark Mllib

Apache spark Spark中相关矩阵值中的行/列名

apache-spark

Apache spark Spark中相关矩阵值中的行/列名,apache-spark,apache-spark-sql,spark-dataframe,apache-spark-mllib,Apache Spark,Apache Spark Sql,Spark Dataframe,Apache Spark Mllib,我在spark中计算了一个关联矩阵，我想结合它们的列名提取单个关联相关矩阵 correlMatrix: org.apache.spark.mllib.linalg.Matrix = 1.0 -0.33333333333333254 -0.8164965809277261 -0.7777777777777787 -0.33333333333333254 1.0 0.8164965809277356 -0.3

我在spark中计算了一个关联矩阵，我想结合它们的列名提取单个关联

相关矩阵

correlMatrix: org.apache.spark.mllib.linalg.Matrix = 
1.0                   -0.33333333333333254  -0.8164965809277261  -0.7777777777777787   
-0.33333333333333254  1.0                   0.8164965809277356   -0.33333333333333254  
-0.8164965809277261   0.8164965809277356    1.0                  0.27216552697591645   
-0.7777777777777787   -0.33333333333333254  0.27216552697591645  1.0

数据场名称

colNames: Array[String] = Array(item_1, item_2, item_3, item_4)

现在，我想用以下结构将每个组合提取到dataframe中：

item_from | item_to | Correlation
item_1    | item_2  | -0.0096912
item_1    | item_3  | -0.7313071
item_2    | item_3  | 0.68910356

或者至少是具有列名的整个相关矩阵：

           item_1                item_2                item_3          item_4
item_1     1.0                   -0.33333333333333254  -0.8164965809277261  -0.7777777777777787   
item_2     -0.33333333333333254  1.0                   0.8164965809277356   -0.33333333333333254  
item_3     -0.8164965809277261   0.8164965809277356    1.0                  0.27216552697591645   
item_4     -0.7777777777777787   -0.33333333333333254  0.27216552697591645  1.0

我曾尝试编写一个映射函数，但它没有像我预期的那样工作

你有什么办法可以建议吗

val colNamePairs = colsNames.flatMap(c1 => colsNames.map(c2 => (c1, c2)))

val triplesList = colNamePairs.zip(correlMatrix.toArray)
  .filterNot(p => p._1._1 >= p._1._2)
  .map(r => (r._1._1, r._1._2, r._2))

val corrDF = sc.parallelize(triplesList).toDF("item_from", "item_to", "Correlation")

colNamePairs生成列名称的所有组合 triplesList表示由（colName1、colName2、correlation）组成的三元组列表

最后，我们将其转换为具有所需列名的DF

请注意filterNot是可选的，只保留矩阵的一半（不包括对角线），因为它是对称的，因此是冗余的，如果您想要完整的列表，只需删除它即可