Scala spark2.0数据帧将多行按列收集为数组
我有一些数据帧,如下所示,如果列值相同,我希望将muilt行转换为数组Scala spark2.0数据帧将多行按列收集为数组,scala,apache-spark,Scala,Apache Spark,我有一些数据帧,如下所示,如果列值相同,我希望将muilt行转换为数组 val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show +---+---+----+------+ |id1|id2|type|value2| +---+---+----+------+ | a| b| sum| 0| | a| b| avg|
val data = Seq(("a","b","sum",0),("a","b","avg",2)).toDF("id1","id2","type","value2").show
+---+---+----+------+
|id1|id2|type|value2|
+---+---+----+------+
| a| b| sum| 0|
| a| b| avg| 2|
+---+---+----+------+
我想把它转换到下面
+---+---+----+------+
|id1|id2|agg |value2|
+---+---+----+------+
| a| b| 0,2| 0|
+---+---+----+------+
printSchema应该如下所示
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = true)
| |-- sum: int (nullable = true)
| |-- dc: int (nullable = true)
你可以:
import org.apache.spark.sql.functions._
val data = Seq(
("a","b","sum",0),("a","b","avg",2)
).toDF("id1","id2","type","value2")
val result = data.groupBy($"id1", $"id2").agg(struct(
first(when($"type" === "sum", $"value2"), true).alias("sum"),
first(when($"type" === "avg", $"value2"), true).alias("avg")
).alias("agg"))
result.show
+---+---+-----+
|id1|id2| agg|
+---+---+-----+
| a| b|[0,2]|
+---+---+-----+
result.printSchema
root
|-- id1: string (nullable = true)
|-- id2: string (nullable = true)
|-- agg: struct (nullable = false)
| |-- sum: integer (nullable = true)
| |-- avg: integer (nullable = true)
第二个表中的值2是什么?我的意思是2的值是0