Scala 如何在struct元素上分组并将其转换回具有相同架构的struct

Scala 如何在struct元素上分组并将其转换回具有相同架构的struct,scala,apache-spark,apache-spark-sql,spark-streaming,Scala,Apache Spark,Apache Spark Sql,Spark Streaming,Spark 2.4.5 在我的数据帧中,我有一个struct数组,该数组不时保存一个字段的快照 现在,我正在寻找一种在数据发生更改时仅使用快照的方法 我的模式如下 root |-- fee: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- updated_at: long (nullable = true) | | |-- fee: float (nul

Spark 2.4.5 在我的数据帧中,我有一个struct数组,该数组不时保存一个字段的快照

现在,我正在寻找一种在数据发生更改时仅使用快照的方法

我的模式如下

root 
 |-- fee: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- updated_at: long (nullable = true)
 |    |    |-- fee: float (nullable = true)
 |-- status: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- updated_at: long (nullable = true)
 |    |    |-- status: string (nullable = true)
现有产出:

+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|fee                                                                     |status                                                                                       |
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|[[1584579671000, 12.11], [1584579672000, 12.11], [1584579673000, 12.11]]|[[1584579671000, Closed-A], [1584579672000, Closed-A], [1584579673000, Closed-B], [1584579674000, Closed], [1584579675000, Closed-A]]|
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
由于“费用”列没有更改,因此它应该只有一个条目 由于状态已更改几次,因此o/p将为[[1584579671000,关闭-a],[1584579673000,关闭-B],[1584579674000,关闭],[158457967500,关闭-a]] 注意此处状态“Closed-A”出现两次

正在尝试获取以下输出:

+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|fee                     |status                                                                                        |
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+
|[[1584579671000, 12.11]]|[[1584579671000, Closed-A], [1584579673000, Closed-B], [1584579674000, Closed], [1584579675000, Closed-A]]|
+------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------+

注意:尝试不使用用户定义的函数。

使用Spark Dataframe API,上述问题可以通过以下方式解决:;添加一个单调递增的id以唯一标识每个记录,分解并展平数据帧,分别按费用和状态分组(根据要求),按id聚合分组的数据场以收集结构,使用id连接两个数据帧,id可以在最终数据场中丢弃

import org.apache.spark.sql.functions.explode
import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.monotonically_increasing_id
import org.apache.spark.sql.functions.collect_list
import org.apache.spark.sql.functions.struct

val idDF = df.withColumn("id", monotonically_increasing_id)

val explodeDf = idDF
  .select(col("id"), col("status"), explode(col("fee")).as("fee"))
  .select(col("id"), col("fee"), explode(col("status")).as("status"))

val flatDF = explodeDf.select(col("id"), col("fee.fee"), col("fee.updated_at").as("updated_at_fee"), col("status.status"), col("status.updated_at").as("updated_at_status"))

val feeDF = flatDF.groupBy("id", "fee").min("updated_at_fee")
val feeSelectDF = feeDF.select(col("id"), col("fee"), col("min(updated_at_fee)").as("updated_at"))
val feeAggDF = feeSelectDF.groupBy("id").agg(collect_list(struct("fee", "updated_at")).as("fee"))


val statusDF = flatDF.groupBy("id", "status").min("updated_at_status")
val statusSelectDF = statusDF.select(col("id"), col("status"), col("min(updated_at_status)").as("updated_at"))
val statusAggDF = statusSelectDF.groupBy("id").agg(collect_list(struct("status", "updated_at")).as("status"))

val finalDF = feeAggDF.join(statusAggDF, "id")
finalDF.show(10)
finalDF.printSchema()

@jxc Spark 2.4.5版在您的输入中关闭-A出现了3次,该值如何?[1584579672000,关闭-A]?感谢您的帮助,但该函数的组似乎没有帮助。条目[[1584579671000,Closed-A],[1584579673000,Closed-A],[1584579674000,Closed],[158457967500,Closed-A]的示例输出需要是[[1584579671000,Closed-A],[1584579674000,Closed-A]]