Apache spark Spark：数据帧聚合（Scala）_Apache Spark_Apache Spark Sql_Spark Dataframe

Apache spark Spark：数据帧聚合（Scala）

apache-spark

Apache spark Spark：数据帧聚合（Scala）,apache-spark,apache-spark-sql,spark-dataframe,Apache Spark,Apache Spark Sql,Spark Dataframe,我在scala中聚合Spark dataframe上的数据有以下要求。我有两个数据集数据集1包含分布在几个不同列（如t1、t2…）上的每个“t”类型的值（val1、val2…）数据集2通过为每个“t”类型分别设置一行来表示相同的内容现在，我需要按（id，t，t*）字段分组，并将sum（val）和sum（val*）的余额作为单独的记录打印出来。两者的平衡应该是相等的 My output should look like below: +---+---+--------+---+-----

我在scala中聚合Spark dataframe上的数据有以下要求。我有两个数据集

数据集1包含分布在几个不同列（如t1、t2…）上的每个“t”类型的值（val1、val2…）

数据集2通过为每个“t”类型分别设置一行来表示相同的内容

现在，我需要按（id，t，t*）字段分组，并将sum（val）和sum（val*）的余额作为单独的记录打印出来。两者的平衡应该是相等的

My output should look like below:
+---+---+--------+---+---------+
|id1| t |sum(val)| t*|sum(val*)|
+---+---+--------+---+---------+
|  1|111|     200|111|      200|
|  1|221|     100|221|      100|
|  1|331|    1000|331|     1000|
|  2|112|     400|112|      400|
|  2|222|     500|222|      500|
|  2|332|    1000|332|     1000|
|  3|113|     600|113|      600|
|  3|223|    1000|223|     1000|
|  3|333|    1000|333|     1000|
+---+---+--------+---+---------+

我正在考虑将数据集1分解为每个“t”类型的多个记录，然后与数据集2合并。

但是，如果数据集变得更大，您能为我推荐一种不会影响性能的更好方法吗？

最简单的解决方案是进行子选择，然后合并数据集：

val ts = Seq(1, 2, 3)
val dfs = ts.map (t => data1.select("t" + t as "t", "v" + t as "v"))
val unioned = dfs.drop(1).foldLeft(dfs(0))((l, r) => l.union(r))

val ds = unioned.join(df2, 't === col("t*")
here aggregation

您也可以使用“分解”尝试阵列：

val df1 = data1.withColumn("colList", array('t1, 't2, 't3))
               .withColumn("t", explode(colList))
               .select('t, 'id1 as "id")

val ds = df2.withColumn("val", 
          when('t === 't1, 'val1)
          .when('t === 't2, 'val2)
          .when('t === 't3, 'val3)
          .otherwise(0))

最后一步是使用data2连接此数据集：

ds.join(data2, 't === col("t*"))
  .groupBy("t", "t*")
  .agg(first("id1") as "id1", sum(val), sum("val*"))

val df1 = data1.withColumn("colList", array('t1, 't2, 't3))
               .withColumn("t", explode(colList))
               .select('t, 'id1 as "id")

val ds = df2.withColumn("val", 
          when('t === 't1, 'val1)
          .when('t === 't2, 'val2)
          .when('t === 't3, 'val3)
          .otherwise(0))

ds.join(data2, 't === col("t*"))
  .groupBy("t", "t*")
  .agg(first("id1") as "id1", sum(val), sum("val*"))