Arrays 如何对Spark数据帧中嵌套数组中的结构的值求和？_Arrays_Scala_Apache Spark_Apache Spark Sql

Arrays 如何对Spark数据帧中嵌套数组中的结构的值求和？

arrays scala apache-spark

Arrays 如何对Spark数据帧中嵌套数组中的结构的值求和？,arrays,scala,apache-spark,apache-spark-sql,Arrays,Scala,Apache Spark,Apache Spark Sql,这在Spark 2.1中，给定此输入文件： `order.json {"id":1,"price":202.30,"userid":1} {"id":2,"price":343.99,"userid":1} {"id":3,"price":399.99,"userid":2} 以及以下数据帧： val order = sqlContext.read.json("order.json") val df2 = order.select(struct("*") as 'order) val df3

这在Spark 2.1中，给定此输入文件：

`order.json

{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}

以及以下数据帧：

val order = sqlContext.read.json("order.json")
val df2 = order.select(struct("*") as 'order)
val df3 = df2.groupBy("order.userId").agg( collect_list( $"order").as("array"))

df3具有以下内容：

+------+---------------------------+
|用户ID |数组|
+------+---------------------------+
|1     |[[1,202.3,1], [2,343.99,1]]|
|2     |[[3,399.99,2]]             |
+------+---------------------------+

结构：

根目录
|--userId:long（nullable=true）
|--数组：数组（nullable=true）
||--元素：struct（containsnall=true）
|| |--id:long（nullable=true）
|| |--价格：双倍（可为空=真）
|| |--userid:long（nullable=true）

现在假设我得到了df3：

我想计算每个userId的array.price之和，利用array per userId行

我会在结果数据帧中的一个新列中添加此计算。就像我用column（“sum”，lit（0））做了df3，但是用我的计算替换了lit（0）

这可能会被认为是一个非常艰难的过程，但我在这两方面都有困难。我没有找到任何方法来访问整个数组，而是执行每行的计算（例如使用foldLeft）

我想计算每个用户ID的array.price之和，利用数组的优势

不幸的是，在这里使用数组对您不利。Spark SQL和

DataFrame

DSL都没有提供可直接用于在任意大小的数组上处理此任务的工具，而无需先分解（

explode

）

您可以使用自定义项：

import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}

val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("price")).sum)
df3.withColumn("totalPrice", totalPrice($"array"))

+------+------+-----------+
|用户ID |数组|总价|
+------+--------------------+----------+
|     1|[[1,202.3,1], [2,...|    546.29|
|     2|      [[3,399.99,2]]|    399.99|
+------+--------------------+----------+

或转换为静态类型的

数据集

：

df3
  .as[(Long, Seq[(Long, Double, Long)])]
  .map{ case (id, xs) => (id, xs, xs.map(_._2).sum) }
  .toDF("userId", "array", "totalPrice").show

+------+--------------------+----------+
|用户ID |数组|总价|
+------+--------------------+----------+
|     1|[[1,202.3,1], [2,...|    546.29|
|     2|      [[3,399.99,2]]|    399.99|
+------+--------------------+----------+

如上所述，您分解并聚合：

import org.apache.spark.sql.functions.{sum, first}

df3
  .withColumn("price", explode($"array.price"))
  .groupBy($"userId")
  .agg(sum($"price"), df3.columns.tail.map(c => first(c).alias(c)): _*)

+------+----------+--------------------+
|用户ID |总和（价格）|数组|
+------+----------+--------------------+
|     1|    546.29|[[1,202.3,1], [2,...|
|     2|    399.99|      [[3,399.99,2]]|
+------+----------+--------------------+

但是它很昂贵，并且不使用现有的结构

你可以使用一个丑陋的技巧：

import org.apache.spark.sql.functions.{coalesce, lit, max, size}

val totalPrice = (0 to df3.agg(max(size($"array"))).as[Int].first)
  .map(i => coalesce($"array.price".getItem(i), lit(0.0)))
  .foldLeft(lit(0.0))(_ + _)

df3.withColumn("totalPrice", totalPrice)

+------+--------------------+----------+
|用户ID |数组|总价|
+------+--------------------+----------+
|     1|[[1,202.3,1], [2,...|    546.29|
|     2|      [[3,399.99,2]]|    399.99|
+------+--------------------+----------+

但这与其说是一个真正的解决方案，不如说是一种好奇心。