Arrays 如何对Spark数据帧中嵌套数组中的结构的值求和?
这在Spark 2.1中,给定此输入文件: `order.jsonArrays 如何对Spark数据帧中嵌套数组中的结构的值求和?,arrays,scala,apache-spark,apache-spark-sql,Arrays,Scala,Apache Spark,Apache Spark Sql,这在Spark 2.1中,给定此输入文件: `order.json {"id":1,"price":202.30,"userid":1} {"id":2,"price":343.99,"userid":1} {"id":3,"price":399.99,"userid":2} 以及以下数据帧: val order = sqlContext.read.json("order.json") val df2 = order.select(struct("*") as 'order) val df3
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
以及以下数据帧:
val order = sqlContext.read.json("order.json")
val df2 = order.select(struct("*") as 'order)
val df3 = df2.groupBy("order.userId").agg( collect_list( $"order").as("array"))
df3具有以下内容:
+------+---------------------------+
|用户ID |数组|
+------+---------------------------+
|1 |[[1,202.3,1], [2,343.99,1]]|
|2 |[[3,399.99,2]] |
+------+---------------------------+
结构:
根目录
|--userId:long(nullable=true)
|--数组:数组(nullable=true)
||--元素:struct(containsnall=true)
|| |--id:long(nullable=true)
|| |--价格:双倍(可为空=真)
|| |--userid:long(nullable=true)
现在假设我得到了df3:
DataFrame
DSL都没有提供可直接用于在任意大小的数组上处理此任务的工具,而无需先分解(explode
)
您可以使用自定义项:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("price")).sum)
df3.withColumn("totalPrice", totalPrice($"array"))
+------+------+-----------+
|用户ID |数组|总价|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
或转换为静态类型的数据集
:
df3
.as[(Long, Seq[(Long, Double, Long)])]
.map{ case (id, xs) => (id, xs, xs.map(_._2).sum) }
.toDF("userId", "array", "totalPrice").show
+------+--------------------+----------+
|用户ID |数组|总价|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
如上所述,您分解并聚合:
import org.apache.spark.sql.functions.{sum, first}
df3
.withColumn("price", explode($"array.price"))
.groupBy($"userId")
.agg(sum($"price"), df3.columns.tail.map(c => first(c).alias(c)): _*)
+------+----------+--------------------+
|用户ID |总和(价格)|数组|
+------+----------+--------------------+
| 1| 546.29|[[1,202.3,1], [2,...|
| 2| 399.99| [[3,399.99,2]]|
+------+----------+--------------------+
但是它很昂贵,并且不使用现有的结构
你可以使用一个丑陋的技巧:
import org.apache.spark.sql.functions.{coalesce, lit, max, size}
val totalPrice = (0 to df3.agg(max(size($"array"))).as[Int].first)
.map(i => coalesce($"array.price".getItem(i), lit(0.0)))
.foldLeft(lit(0.0))(_ + _)
df3.withColumn("totalPrice", totalPrice)
+------+--------------------+----------+
|用户ID |数组|总价|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
但这与其说是一个真正的解决方案,不如说是一种好奇心。Spark 2.4.0及更高版本 您现在可以使用该功能
df3.createOrReplaceTempView("orders")
spark.sql(
"""
|SELECT
| *,
| AGGREGATE(`array`, 0.0, (accumulator, item) -> accumulator + item.price) AS totalPrice
|FROM
| orders
|""".stripMargin).show()