Apache spark 使用Spark按一列的连续相同值分组,每组取另一列的最大值或最小值
假设我有一个如下的数据帧Apache spark 使用Spark按一列的连续相同值分组,每组取另一列的最大值或最小值,apache-spark,Apache Spark,假设我有一个如下的数据帧 +-------------------+------+------------+ | Date| Val| Condition| +-------------------+------+------------+ |2020-10-02 10:00:00|211.39| Max| |2020-10-02 10:10:00|210.94| Min| |2020-10-02 10:30:00|209.21
+-------------------+------+------------+
| Date| Val| Condition|
+-------------------+------+------------+
|2020-10-02 10:00:00|211.39| Max|
|2020-10-02 10:10:00|210.94| Min|
|2020-10-02 10:30:00|209.21| Max|
|2020-10-02 11:20:00|207.48| Min|
|2020-10-02 11:50:00|207.22| Min| <- take only this row because it's less than 207.48
|2020-10-02 12:10:00|207.58| Max|
|2020-10-02 12:40:00|207.45| Min|
|2020-10-02 13:10:00|207.45| Min| <- take either row becase they are equal
|2020-10-02 13:40:00| 208.7| Max| <- take only this row because it's greater than 208.31
|2020-10-02 14:10:00|208.31| Max|
|2020-10-02 14:20:00|208.16| Min|
|2020-10-02 14:30:00| 208.3| Max|
|2020-10-02 14:50:00|208.25| Min|
|2020-10-02 15:10:00| 208.7| Max|
|2020-10-02 15:30:00|208.08| Min|
|2020-10-02 16:00:00| 208.0| Min| <- take only this row because it's less than 208.08
|2020-10-02 16:30:00|208.35| Max|
|2020-10-02 16:40:00|208.26| Min|
|2020-10-02 16:50:00|208.27| Max|
|2020-10-02 17:30:00|208.06| Min|
+-------------------+------+------------+
目标是:
- 对于有多个连续行且条件=最大值或条件=最小值的每个组
- 要从每个组中仅获取一行(哪一行-由条件的值决定-它是列Val的最大值或最小值的行)
- 试试这个
val wind = Window.orderBy("Date")
val df1 = df.withColumn("val1", when($"Condition" === lead($"Condition", 1).over(wind),
when($"Condition" === "Min", min($"val").over(wind.rowsBetween(0,1))).otherwise(max($"val").over(wind.rowsBetween(0,1))))
.when($"Condition" === lag($"Condition", 1).over(wind),
when($"Condition" === "Min", min($"val").over(wind.rowsBetween(-1,0))).otherwise(max($"val").over(wind.rowsBetween(-1,0))))
.otherwise($"val"))
val df2 = df1.withColumn("rn", when($"Condition" === lead($"Condition", 1).over(wind),1)
.when($"Condition" === lag($"Condition", 1).over(wind), 2)
.otherwise(1)).withColumn("Val", $"val1").filter($"rn" === 1).drop("rn", "val1")
df2.show(false)
+-------------------+------+---------+
|Date |Val |Condition|
+-------------------+------+---------+
|2020-10-02 10:00:00|211.39|Max |
|2020-10-02 10:10:00|210.94|Min |
|2020-10-02 10:30:00|209.21|Max |
|2020-10-02 11:20:00|207.22|Min |
|2020-10-02 12:10:00|207.58|Max |
|2020-10-02 12:40:00|207.45|Min |
|2020-10-02 13:40:00|208.7 |Max |
|2020-10-02 14:20:00|208.16|Min |
|2020-10-02 14:30:00|208.3 |Max |
|2020-10-02 14:50:00|208.25|Min |
|2020-10-02 15:10:00|208.7 |Max |
|2020-10-02 15:30:00|208.0 |Min |
|2020-10-02 16:30:00|208.35|Max |
|2020-10-02 16:40:00|208.26|Min |
|2020-10-02 16:50:00|208.27|Max |
|2020-10-02 17:30:00|208.06|Min |
+-------------------+------+---------+
如果它对您有帮助,请告诉我。我提出了以下解决方案,它没有经过优化,可能还可以改进,但似乎给出了正确的结果
import org.apache.spark.sql.expressions.Window
val w1=窗口订购人(“日期”)
val w2=窗口.订购人(“日期”)
.rowsBetween(Window.unbounddpreceiding,Window.currentRow)
val w3=窗口。分区依据(“总和”)。排序依据(lit(1))
val grpExtrema=极值
当(滞后(“条件”,1)超过(w1)!==极值(“条件”),1)时,使用列(“重复”),否则(0))
带列(“总和”,总和(“重复”)。超过(w2))
.drop(“dupe”)
grpExtrema
.withColumn(“行”,行号超过(w3))
.withColumn(“Val”,
当($“条件”==点亮(“最小”),最小(“Val”)。超过(w3))
。否则(最大(“Val”)。超过(w3)))
。其中(列(“行”)==1)。选择(“日期”、“值”、“条件”)
.show()
这个问题可以通过为连续行中具有相同条件的行准备具有组号的额外列(“下面查询中的“组”列)来解决
val numOfPartitions=
val window=window.orderBy(“日期”)
df
.withColumn(“条件变化”),当(col(“条件”)==滞后(“条件”,1,false)。超过(窗口),0)。否则(1))
.带列(“组”,总和(“条件变化”)。超过(窗口))
.drop(“条件变化”)
.重新分区(NUMOF分区)
.groupBy(“集团”)
阿格先生(
min(结构(“Val”、“日期”)为“min”,
max(结构(“Val”、“日期”)为“max”,
第一(“条件”)作为“条件”)
当(列(“条件”)==“最小”,列(“最小”))时,使用列(“结果”)。否则(列(“最大”))
。选择(col(“result.Date”)作为“日期”,col(“result.Val”)作为“Val”,col(“条件”))
.show()
请注意,您必须为重新分区设置
numOfPartitions
(否则任务将在一个执行器上运行),选择与您拥有的数据量相匹配的值,第一次尝试可以是“spark.sql.shuffle.partitions”的值。您可以尝试按行数除以2(四舍五入数)进行分组@mad程序员对输出DF有点混淆。你能详细说明一下吗?@SathiyanS我编辑了这个问题,试图让它不那么混乱问题的提出是这样的,它不能被并行化,尽管Window.orderBy(“Date”)
可能会给出预期的结果,但这只会使用一个核心,不会与更大的数据进行扩展@谢谢,这很有帮助。同时,我设法找到了另一个似乎也有效的解决方案,看看我自己的答案。由于我是Spark的新手,如果您能对此发表评论,以防您发现任何问题,那将是一件好事。我发现您和我的解决方案都选择了一个错误的日期(例如,对于Val 207.22的行,它应该是2020-10-02 11:50:00,而不是2020-10-02 11:20:00)
val wind = Window.orderBy("Date")
val df1 = df.withColumn("val1", when($"Condition" === lead($"Condition", 1).over(wind),
when($"Condition" === "Min", min($"val").over(wind.rowsBetween(0,1))).otherwise(max($"val").over(wind.rowsBetween(0,1))))
.when($"Condition" === lag($"Condition", 1).over(wind),
when($"Condition" === "Min", min($"val").over(wind.rowsBetween(-1,0))).otherwise(max($"val").over(wind.rowsBetween(-1,0))))
.otherwise($"val"))
val df2 = df1.withColumn("rn", when($"Condition" === lead($"Condition", 1).over(wind),1)
.when($"Condition" === lag($"Condition", 1).over(wind), 2)
.otherwise(1)).withColumn("Val", $"val1").filter($"rn" === 1).drop("rn", "val1")
df2.show(false)
+-------------------+------+---------+
|Date |Val |Condition|
+-------------------+------+---------+
|2020-10-02 10:00:00|211.39|Max |
|2020-10-02 10:10:00|210.94|Min |
|2020-10-02 10:30:00|209.21|Max |
|2020-10-02 11:20:00|207.22|Min |
|2020-10-02 12:10:00|207.58|Max |
|2020-10-02 12:40:00|207.45|Min |
|2020-10-02 13:40:00|208.7 |Max |
|2020-10-02 14:20:00|208.16|Min |
|2020-10-02 14:30:00|208.3 |Max |
|2020-10-02 14:50:00|208.25|Min |
|2020-10-02 15:10:00|208.7 |Max |
|2020-10-02 15:30:00|208.0 |Min |
|2020-10-02 16:30:00|208.35|Max |
|2020-10-02 16:40:00|208.26|Min |
|2020-10-02 16:50:00|208.27|Max |
|2020-10-02 17:30:00|208.06|Min |
+-------------------+------+---------+