Scala 窗口上的Spark条件滞后函数
我有一个数据框,其中一个值Scala 窗口上的Spark条件滞后函数,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个数据框,其中一个值标签与(id、bin、日期、小时)关联: 我想在前一天的同一个小时、前一天的一个小时等将多个列附加到与标签相对应的数据帧。我知道如何使用lag函数获取第一个列: val dateWindow = Window.partitionBy($"id", $"bin").orderBy($"hour", $"date") val expandedDf = data.withColumn("yesterdaySameHour", lag($"label", 1, 0.0).ove
标签
与(id、bin、日期、小时)
关联:
我想在前一天的同一个小时、前一天的一个小时等将多个列附加到与标签相对应的数据帧。我知道如何使用lag函数获取第一个列:
val dateWindow = Window.partitionBy($"id", $"bin").orderBy($"hour", $"date")
val expandedDf = data.withColumn("yesterdaySameHour", lag($"label", 1, 0.0).over(dateWindow))
但是,我不知道如何在前一天的hour-1
获取值label
。有没有一种方法可以产生一个条件延迟,我可以过滤掉大于或等于当前行小时的hour
?如果没有,正确的方法是什么
非常感谢。您必须根据自己的用途指定窗口的功能。您可能需要使用lag
功能两次
import org.apache.spark.sql.expressions.Window
val dW = Window.partitionBy("id", "bin", "hour").orderBy("date")
val hW = Window.partitionBy("id", "bin", "date").orderBy("hour")
df.withColumn("yesterdaySameHour", lag("label", 1, 0.0).over(dW))
.withColumn("todayPreviousHour", lag("label", 1, 0.0).over(hW))
.withColumn("yestedayPreviousHour", lag(lag("label", 1, 0.0).over(dW), 1, 0.0).over(hW))
.orderBy("date", "hour", "bin")
.show(false)
这将为您提供以下结果:
+----------+----+---+---+-----+-----------------+-----------------+--------------------+
|date |hour|id |bin|label|yesterdaySameHour|todayPreviousHour|yestedayPreviousHour|
+----------+----+---+---+-----+-----------------+-----------------+--------------------+
|2019_12_19|7 |1 |0 |-1 |0 |0 |0 |
|2019_12_19|7 |1 |2 |-2 |0 |0 |0 |
|2019_12_19|7 |1 |3 |-3 |0 |0 |0 |
|2019_12_19|8 |1 |0 |1 |0 |-1 |0 |
|2019_12_19|8 |1 |2 |2 |0 |-2 |0 |
|2019_12_19|8 |1 |3 |3 |0 |-3 |0 |
|2019_12_20|7 |1 |0 |4 |-1 |0 |0 |
|2019_12_20|7 |1 |2 |5 |-2 |0 |0 |
|2019_12_20|7 |1 |3 |6 |-3 |0 |0 |
|2019_12_20|8 |1 |0 |7 |1 |4 |-1 |
|2019_12_20|8 |1 |2 |8 |2 |5 |-2 |
|2019_12_20|8 |1 |3 |9 |3 |6 |-3 |
+----------+----+---+---+-----+-----------------+-----------------+--------------------+
将日期移动到partitionby。谢谢,我理解这一点-我的问题是我是否可以用Windows获得“昨天前一个小时”。目前,我做了一个自我加入,并根据日期/小时差异进行筛选,但这感觉太过分了。你可以结合lag twicr得到前一天的小时数-1。我将在一天内更新答案。@ZeynepAkkalyoncuYilmaz,更新了我的答案。我希望这能解决你的问题。
+----------+----+---+---+-----+-----------------+-----------------+--------------------+
|date |hour|id |bin|label|yesterdaySameHour|todayPreviousHour|yestedayPreviousHour|
+----------+----+---+---+-----+-----------------+-----------------+--------------------+
|2019_12_19|7 |1 |0 |-1 |0 |0 |0 |
|2019_12_19|7 |1 |2 |-2 |0 |0 |0 |
|2019_12_19|7 |1 |3 |-3 |0 |0 |0 |
|2019_12_19|8 |1 |0 |1 |0 |-1 |0 |
|2019_12_19|8 |1 |2 |2 |0 |-2 |0 |
|2019_12_19|8 |1 |3 |3 |0 |-3 |0 |
|2019_12_20|7 |1 |0 |4 |-1 |0 |0 |
|2019_12_20|7 |1 |2 |5 |-2 |0 |0 |
|2019_12_20|7 |1 |3 |6 |-3 |0 |0 |
|2019_12_20|8 |1 |0 |7 |1 |4 |-1 |
|2019_12_20|8 |1 |2 |8 |2 |5 |-2 |
|2019_12_20|8 |1 |3 |9 |3 |6 |-3 |
+----------+----+---+---+-----+-----------------+-----------------+--------------------+