Scala Spark数据帧到嵌套映射

Scala Spark数据帧到嵌套映射,scala,apache-spark,dataframe,hashmap,apache-spark-sql,Scala,Apache Spark,Dataframe,Hashmap,Apache Spark Sql,如何将spark中相当小的数据帧(最大300 MB)转换为嵌套贴图以提高spark的DAG。我相信这个操作将比后面()的连接更快,因为转换后的值是在定制估计器的训练步骤中创建的。现在我只想在管道的预测步骤中快速应用它们 val inputSmall = Seq( ("A", 0.3, "B", 0.25), ("A", 0.3, "g", 0.4), ("d", 0.0, "f", 0.1), ("d", 0.0, "d", 0.7), ("A", 0.3

如何将spark中相当小的数据帧(最大300 MB)转换为嵌套贴图以提高spark的DAG。我相信这个操作将比后面()的连接更快,因为转换后的值是在定制估计器的训练步骤中创建的。现在我只想在管道的预测步骤中快速应用它们

val inputSmall = Seq(
    ("A", 0.3, "B", 0.25),
    ("A", 0.3, "g", 0.4),
    ("d", 0.0, "f", 0.1),
    ("d", 0.0, "d", 0.7),
    ("A", 0.3, "d", 0.7),
    ("d", 0.0, "g", 0.4),
    ("c", 0.2, "B", 0.25)).toDF("column1", "transformedCol1", "column2", "transformedCol2")
这给出了错误的地图类型

val inputToMap = inputSmall.collect.map(r => Map(inputSmall.columns.zip(r.toSeq):_*))
我宁愿想要像这样的东西:

Map[String, Map[String, Double]]("column1" -> Map("A" -> 0.3, "d" -> 0.0, ...), "column2" -> Map("B" -> 0.25), "g" -> 0.4, ...)

我不确定我是否遵循了动机,但我认为这是一种转变,它会让你得到你想要的结果:

// collect from DF (by your assumption - it is small enough)
val data: Array[Row] = inputSmall.collect()

// Create the "column pairs" -
// can be replaced with hard-coded value: List(("column1", "transformedCol1"), ("column2", "transformedCol2"))
val columnPairs: List[(String, String)] = inputSmall.columns
  .grouped(2)
  .collect { case Array(k, v) => (k, v) }
  .toList

// for each pair, get data and group it by left-column's value, choosing first match
val result: Map[String, Map[String, Double]] = columnPairs
  .map { case (k, v) => k -> data.map(r => (r.getAs[String](k), r.getAs[Double](v))) }
  .toMap
  .mapValues(l => l.groupBy(_._1).map { case (c, l2) => l2.head })

result.foreach(println)
// prints: 
// (column1,Map(A -> 0.3, d -> 0.0, c -> 0.2))
// (column2,Map(d -> 0.7, g -> 0.4, f -> 0.1, B -> 0.25))

编辑:已从最终地图中删除收集操作

如果您使用的是Spark 2+,这里有一个建议:

val inputToMap = inputSmall.select(
  map($"column1", $"transformedCol1").as("column1"),
  map($"column2", $"transformedCol2").as("column2")
)

val cols = inputToMap.columns
val localData = inputToMap.collect

cols.map { colName => 
  colName -> localData.flatMap(_.getAs[Map[String, Double]](colName)).toMap
}.toMap