Apache spark Spark：在Spark数据帧上，agg函数和窗口函数之间有区别吗？_Apache Spark_Dataframe_Aggregate Functions_Window Functions

Apache spark Spark：在Spark数据帧上，agg函数和窗口函数之间有区别吗？

apache-spark dataframe

Apache spark Spark：在Spark数据帧上，agg函数和窗口函数之间有区别吗？,apache-spark,dataframe,aggregate-functions,window-functions,Apache Spark,Dataframe,Aggregate Functions,Window Functions,我想对spark Dataframe（spark 2.1）中的一列应用求和，我有两种方法： 1-具有窗口功能： val windowing = Window.partitionBy("id") dataframe .withColumn("sum", sum(col("column_1")) over windowing) dataframe .groupBy("id") .agg(sum(col("column_1")).alias("sum")) 2-使用agg功能： val windo

我想对spark Dataframe（spark 2.1）中的一列应用求和，我有两种方法：

1-具有窗口功能：

val windowing = Window.partitionBy("id")
dataframe
.withColumn("sum", sum(col("column_1")) over windowing)

dataframe
.groupBy("id")
.agg(sum(col("column_1")).alias("sum"))

2-使用agg功能：

val windowing = Window.partitionBy("id")
dataframe
.withColumn("sum", sum(col("column_1")) over windowing)

dataframe
.groupBy("id")
.agg(sum(col("column_1")).alias("sum"))

就表现而言，最好的方法是什么？这两种方法的区别是什么？

您可以在窗口内（第一种情况）或分组时（第二种情况）使用聚合函数。不同之处在于，对于一个窗口，每个行将与在其整个窗口上计算的聚合结果相关联。但是，分组时，每个组将与该组上的聚合结果相关联（一组行仅成为一行）
在你的情况下，你会得到这个

val dataframe=spark.range（6）.带列（“键”，“id%2”） dataframe.show +---+---+ |id |键| +---+---+ | 0| 0| | 1| 1| | 2| 0| | 3| 1| | 4| 0| | 5| 1| +---+---+
案例1：窗口化

val windowing=Window.partitionBy（“键”） dataframe.withColumn（“sum”，sum（col（“id”））over windowing.show +---+---+---+ |id |键|和| +---+---+---+ | 0| 0| 6| | 2| 0| 6| | 4| 0| 6| | 1| 1| 9| | 3| 1| 9| | 5| 1| 9| +---+---+---+
案例2：分组

dataframe.groupBy（“key”）.agg（sum（'id））.show +---+-------+ |密钥|和（id）| +---+-------+ | 0| 6| | 1| 9| +---+-------+
正如@Oli提到的，聚合函数可以在窗口（第一种情况）内使用，也可以与分组（第二种情况）一起使用。就性能而言，“分组聚合函数”将比“窗口聚合函数”快得多。我们可以通过分析物理计划来可视化这一点

df.groupBy("id").agg(sum($"expense").alias("total_expense")).explain() df.show +---+----------+ | id| expense| +---+----------+ | 1| 100| | 2| 300| | 1| 100| | 3| 200| +---+----------+
1-带窗口的聚合：

df.withColumn("total_expense", sum(col("expense")) over window).show +---+----------+-------------------+ | id| expense| total_expense| +---+----------+-------------------+ | 3| 200| 200| | 1| 100| 200| | 1| 100| 200| | 2| 300| 300| +---+----------+-------------------+ df.withColumn("total_expense", sum(col("expense")) over window).explain == Physical Plan == Window [sum(cast(expense#9 as bigint)) windowspecdefinition(id#8, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS total_expense#265L], [id#8] +- *(2) Sort [id#8 ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#8, 200), true, [id=#144] +- *(1) Project [_1#3 AS id#8, _2#4 AS expense#9] +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#3, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#4] +- Scan[obj#2]
2-与GroupBy的聚合：

df.groupBy("id").agg(sum($"expense").alias("total_expense")).show +---+------------------+ | id| total_expense| +---+------------------+ | 3| 200| | 1| 200| | 2| 300| +---+------------------+ df.groupBy("id").agg(sum($"expense").alias("total_expense")).explain() == Physical Plan == *(2) HashAggregate(keys=[id#8], functions=[sum(cast(expense#9 as bigint))]) +- Exchange hashpartitioning(id#8, 200), true, [id=#44] +- *(1) HashAggregate(keys=[id#8], functions=[partial_sum(cast(expense#9 as bigint))]) +- *(1) Project [_1#3 AS id#8, _2#4 AS expense#9] +- *(1) SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#3, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2 AS _2#4] +- Scan[obj#2]
根据执行计划，我们可以看到，在windows情况下，有一个总洗牌和一个排序，而在groupby情况下，有一个减少的洗牌（在局部聚合部分和之后洗牌）