Apache spark 在两个指定的时间边界之间的间隔内(3小时到2小时之前)触发SQL窗口
在Spark SQL中,使用两个预定义边界指定窗口间隔的正确方法是什么 我试图在“3小时前到2小时前”的窗口上总结我表格中的值 运行此查询时:Apache spark 在两个指定的时间边界之间的间隔内(3小时到2小时之前)触发SQL窗口,apache-spark,apache-spark-sql,window-functions,Apache Spark,Apache Spark Sql,Window Functions,在Spark SQL中,使用两个预定义边界指定窗口间隔的正确方法是什么 我试图在“3小时前到2小时前”的窗口上总结我表格中的值 运行此查询时: select *, sum(value) over ( partition by a, b order by cast(time_value as timestamp) range between interval 2 hours preceding and current row ) as sum_value from my_temp_table;
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 2 hours preceding and current row
) as sum_value
from my_temp_table;
这很有效。我得到了我期望的结果,即落在2小时滚动窗口中的值的总和
现在,我需要的是让滚动窗口不绑定到当前行,而是考虑3小时前和2小时前之间的行。
我试过:
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and 2 hours preceding
) as sum_value
from my_temp_table;
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and interval 2 hours preceding
) as sum_value
from my_temp_table;
但是我得到了额外的输入'hours',期望{'previous','FOLLOWING'}
错误
我还尝试了:
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and 2 hours preceding
) as sum_value
from my_temp_table;
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and interval 2 hours preceding
) as sum_value
from my_temp_table;
但是我得到了不同的错误scala.MatchError:CalendarIntervalType(类org.apache.spark.sql.types.CalendarIntervalType$)
我尝试的第三个选择是:
select *, sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and 2 preceding
) as sum_value
from my_temp_table;
而且它并不像我们预期的那样工作:无法解析“之前3小时到之前2小时之间的间隔范围”,因为数据类型不匹配
我在寻找interval类型的文档时遇到了困难,因为它说的不够多,其他信息也有点不完整。至少是我发现的。由于距离间隔不起作用,我不得不转向另一种方法。 事情是这样的:
val hourlyDFs = for ((hourStart, hourEnd) <- (hoursToStart, hoursToEnd).zipped) yield {
val data = data.where($"hour" <= lit(hourEnd) && $"hour" >= lit(hourStart))
// do stuff
// return a data frame
}
hourlyDFs.toSeq().reduce(_.union(_))
- 准备一份需要进行计算的时间间隔列表
- 对于每个间隔,运行计算
- 每一次迭代都会产生一个数据帧
- 迭代之后,我们有一个数据帧列表
- 将列表中的数据帧合并为一个更大的数据帧
- 写出结果
val hourlyDFs = for ((hourStart, hourEnd) <- (hoursToStart, hoursToEnd).zipped) yield {
val data = data.where($"hour" <= lit(hourEnd) && $"hour" >= lit(hourStart))
// do stuff
// return a data frame
}
hourlyDFs.toSeq().reduce(_.union(_))
val hourlyDFs=for((hourStart,hourrend)也有同样的问题,并找到了一个简单的解决方案。你看:
unix_timestamp(datestamp) - unix_timestamp(datestamp) < 10800 --3 hours in seconds
unix\u时间戳(日期戳)-unix\u时间戳(日期戳)<10800--3小时(秒)
您还可以使用时间戳来提高可读性(如果需要,请考虑):
选择unix时间戳(日期格式(当前时间戳,'HH:mm:ss'),'HH:mm:ss')<
unix_时间戳('03:00:00','HH:mm:ss')--用于可读性的时间戳
获得相同结果的解决方法是计算过去3小时内的数值总和,然后减去过去2小时内的数值总和:
select *,
sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 3 hours preceding and current row)
-
sum(value) over (
partition by a, b
order by cast(time_value as timestamp)
range between interval 2 hours preceding and current row)
as sum_value
from my_temp_table;
目前,在SparkSQL中,AFAIK范围间隔无法正常工作,只有基于行数的间隔才是健壮的。请参阅此JIRA票证。我看到的Scala API中也标记了不推荐。好的,因此我要寻找另一种方法。