Scala spark2.0数据帧将多行按列收集为数组_Scala_Apache Spark

Scala spark2.0数据帧将多行按列收集为数组

scala apache-spark

Scala spark2.0数据帧将多行按列收集为数组,scala,apache-spark,Scala,Apache Spark,我有一些数据帧，如下所示，如果列值相同，我希望将muilt行转换为数组 val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show +---+---+----+------+ |id1|id2|type|value2| +---+---+----+------+ | a| b| sum| 0| | a| b| avg|

我有一些数据帧，如下所示，如果列值相同，我希望将muilt行转换为数组

val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
    +---+---+----+------+
    |id1|id2|type|value2|
    +---+---+----+------+
    |  a|  b| sum|     0|
    |  a|  b| avg|     2|
    +---+---+----+------+

我想把它转换到下面

+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
|  a|  b| 0,2|     0|
+---+---+----+------+

printSchema应该如下所示

root
 |-- id1: string (nullable = true)
 |-- id2: string (nullable = true)
 |-- agg: struct (nullable = true)
 |    |-- sum: int (nullable = true)
 |    |-- dc: int (nullable = true)

你可以：

import org.apache.spark.sql.functions._

val data = Seq(
  ("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")

val result = data.groupBy($"id1", $"id2").agg(struct(
  first(when($"type" === "sum", $"value2"), true).alias("sum"), 
  first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))

result.show

+---+---+-----+   
|id1|id2|  agg|
+---+---+-----+
|  a|  b|[0,2]|
+---+---+-----+

result.printSchema
root
 |-- id1: string (nullable = true)
 |-- id2: string (nullable = true)
 |-- agg: struct (nullable = false)
 |    |-- sum: integer (nullable = true)
 |    |-- avg: integer (nullable = true)

第二个表中的值2是什么？我的意思是2的值是0