Apache spark Spark:在Spark数据帧上,agg函数和窗口函数之间有区别吗?

Apache spark Spark:在Spark数据帧上,agg函数和窗口函数之间有区别吗?,apache-spark,dataframe,aggregate-functions,window-functions,Apache Spark,Dataframe,Aggregate Functions,Window Functions,我想对spark Dataframe(spark 2.1)中的一列应用求和,我有两种方法: 1-具有窗口功能: val windowing = Window.partitionBy("id") dataframe .withColumn("sum", sum(col("column_1")) over windowing) dataframe .groupBy("id") .agg(sum(col("column_1")).alias("sum")) 2-使用agg功能: val windo

我想对spark Dataframe(spark 2.1)中的一列应用求和,我有两种方法:

1-具有窗口功能:

val windowing = Window.partitionBy("id")
dataframe
.withColumn("sum", sum(col("column_1")) over windowing)
dataframe
.groupBy("id")
.agg(sum(col("column_1")).alias("sum"))
2-使用agg功能:

val windowing = Window.partitionBy("id")
dataframe
.withColumn("sum", sum(col("column_1")) over windowing)
dataframe
.groupBy("id")
.agg(sum(col("column_1")).alias("sum"))

就表现而言,最好的方法是什么?这两种方法的区别是什么?

您可以在窗口内(第一种情况)或分组时(第二种情况)使用聚合函数。不同之处在于,对于一个窗口,每个行将与在其整个窗口上计算的聚合结果相关联。但是,分组时,每个组将与该组上的聚合结果相关联(一组行仅成为一行)

在你的情况下,你会得到这个

val dataframe=spark.range(6).带列(“键”,“id%2”)
dataframe.show
+---+---+
|id |键|
+---+---+
|  0|  0|
|  1|  1|
|  2|  0|
|  3|  1|
|  4|  0|
|  5|  1|
+---+---+
案例1:窗口化

val windowing=Window.partitionBy(“键”)
dataframe.withColumn(“sum”,sum(col(“id”))over windowing.show
+---+---+---+                                                                   
|id |键|和|
+---+---+---+
|  0|  0|  6|
|  2|  0|  6|
|  4|  0|  6|
|  1|  1|  9|
|  3|  1|  9|
|  5|  1|  9|
+---+---+---+
案例2:分组

dataframe.groupBy(“key”).agg(sum('id)).show
+---+-------+
|密钥|和(id)|
+---+-------+
|  0|      6|
|  1|      9|
+---+-------+

正如@Oli提到的,聚合函数可以在窗口(第一种情况)内使用,也可以与分组(第二种情况)一起使用。就性能而言,“分组聚合函数”将比“窗口聚合函数”快得多。我们可以通过分析物理计划来可视化这一点

df.groupBy("id").agg(sum($"expense").alias("total_expense")).explain()
df.show
+---+----------+                                                                   
|  id|  expense|
+---+----------+
|   1|      100|
|   2|      300|
|   1|      100|
|   3|      200|
+---+----------+
1-带窗口的聚合:

df.withColumn("total_expense", sum(col("expense")) over window).show
+---+----------+-------------------+                                                     
| id|   expense|      total_expense|
+---+----------+-------------------+
|  3|       200|                200|
|  1|       100|                200|
|  1|       100|                200|
|  2|       300|                300|
+---+----------+-------------------+

df.withColumn("total_expense", sum(col("expense")) over window).explain
== Physical Plan ==
Window [sum(cast(expense#9 as bigint)) windowspecdefinition(id#8, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS total_expense#265L], [id#8]
+- *(2) Sort [id#8 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#8, 200), true, [id=#144]
      +- *(1) Project [_1#3 AS id#8, _2#4 AS expense#9]
         +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#3, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#4]
            +- Scan[obj#2]
2-与GroupBy的聚合:

df.groupBy("id").agg(sum($"expense").alias("total_expense")).show
+---+------------------+                                                             
| id|     total_expense|
+---+------------------+
|  3|               200|
|  1|               200|
|  2|               300|
+---+------------------+

df.groupBy("id").agg(sum($"expense").alias("total_expense")).explain()
    == Physical Plan ==
    *(2) HashAggregate(keys=[id#8], functions=[sum(cast(expense#9 as bigint))])
    +- Exchange hashpartitioning(id#8, 200), true, [id=#44]
       +- *(1) HashAggregate(keys=[id#8], functions=[partial_sum(cast(expense#9 as bigint))])
          +- *(1) Project [_1#3 AS id#8, _2#4 AS expense#9]
             +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#3, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#4]
                +- Scan[obj#2]
根据执行计划,我们可以看到,在windows情况下,有一个总洗牌和一个排序,而在groupby情况下,有一个减少的洗牌(在局部聚合部分和之后洗牌)