如何使用Spark Scala或sql对特定时间间隔内的记录进行分组?
我想在scala中对记录进行分组,前提是它们具有相同的ID,并且彼此的时间间隔在1分钟之内 我在概念上是这样想的?但我不是很确定如何使用Spark Scala或sql对特定时间间隔内的记录进行分组?,sql,scala,apache-spark,Sql,Scala,Apache Spark,我想在scala中对记录进行分组,前提是它们具有相同的ID,并且彼此的时间间隔在1分钟之内 我在概念上是这样想的?但我不是很确定 HAVING a.ID = b.ID AND a.time + 30 sec > b.time AND a.time - 30 sec < b.time | ID | volume | Time | |:-----------|------------:|:-----------
HAVING a.ID = b.ID AND a.time + 30 sec > b.time AND a.time - 30 sec < b.time
| ID | volume | Time |
|:-----------|------------:|:--------------------------:|
| 1 | 10 | 2019-02-17T12:00:34Z |
| 2 | 20 | 2019-02-17T11:10:46Z |
| 3 | 30 | 2019-02-17T13:23:34Z |
| 1 | 40 | 2019-02-17T12:01:02Z |
| 2 | 50 | 2019-02-17T11:10:30Z |
| 1 | 60 | 2019-02-17T12:01:57Z |
上面的代码是一个解决方案,但它总是四舍五入
例如2019-02-17T12:00:45Z的范围为
2019-02-17T12:00:00Z TO 2019-02-17T12:01:00Z.
我在寻找这个:
2019-02-17T11:45:00Z至2019-02-17T12:01:45Z。
有办法吗?
org.apache.spark.sql.functions
提供如下重载窗口函数
import org.apache.spark.sql.SparkSession
object SparkWindowTest extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//Prepare Test Data
val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
(3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
(1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
.toDF("ID", "Volume", "TimeString")
df.show()
df.printSchema()
+---+------+-------------------+
| ID|Volume| TimeString|
+---+------+-------------------+
| 1| 10|2019-02-17 12:00:49|
| 2| 20|2019-02-17 11:10:46|
| 3| 30|2019-02-17 13:23:34|
| 2| 50|2019-02-17 11:10:30|
| 1| 40|2019-02-17 12:01:02|
| 1| 60|2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
//Converted String Timestamp into Timestamp
val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))
//Dropped String Timestamp from DF
val modifiedDF1 = modifiedDF.drop("TimeString")
modifiedDF.show(false)
modifiedDF.printSchema()
+---+------+-------------------+-------------------+
|ID |Volume|TimeString |Time |
+---+------+-------------------+-------------------+
|1 |10 |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
|-- Time: timestamp (nullable = true)
modifiedDF1.show(false)
modifiedDF1.printSchema()
+---+------+-------------------+
|ID |Volume|Time |
+---+------+-------------------+
|1 |10 |2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- Time: timestamp (nullable = true)
//Main logic
val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")
//Renamed all columns of DF.
val newNames = Seq("ID", "WINDOW", "VOLUME")
val finalDF = modifiedDF2.toDF(newNames: _*)
finalDF.show(false)
+---+---------------------------------------------+------+
|ID |WINDOW |VOLUME|
+---+---------------------------------------------+------+
|2 |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50 |
|1 |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60 |
|1 |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50 |
|3 |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30 |
|2 |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20 |
+---+---------------------------------------------+------+
}
1。窗口(timeColumn:Column,windowDuration:String):根据指定列的时间戳生成翻滚时间窗口。窗口开始是包含的,但窗口结束是独占的,例如12:05将在窗口[12:05,12:10]中,但不在[12:00,12:05]中
窗口将看起来像:
{{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}
{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}
{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}
2.窗口((时间列:列,窗口持续时间:字符串,滑动时间:字符串):
在给定时间戳指定列的情况下,将行压缩到一个或多个时间窗口中。窗口开始是包含的,但窗口结束是独占的,例如12:05将在窗口[12:05,12:10]中,但不在[12:00,12:05]中。
slideDuration指定窗口滑动间隔的参数,例如1分钟
。将每隔slideDuration
生成一个新窗口。必须小于或等于窗口持续时间
窗口将看起来像:
{{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}
{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}
{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}
3.窗口((timeColumn:Column,windowDuration:String,slideDuration:String,startTime:String):给定指定列的时间戳,将行压缩到一个或多个时间窗口中。窗口开始是包含的,但窗口结束是独占的,例如12:05将在窗口[12:05,12:10]中,但不在[12:00,12:05]中
窗口将看起来像:
{{{
09:00:00-09:01:00
09:01:00-09:02:00
09:02:00-09:03:00 ...
}}}
{{{
09:00:00-09:01:00
09:00:10-09:01:10
09:00:20-09:01:20 ...
}}}
{{{
09:00:05-09:01:05
09:00:15-09:01:15
09:00:25-09:01:25 ...
}}}
例如,为了使每小时滚动窗口在一小时后15分钟开始,例如12:15-13:15,13:15-14:15…提供startTime
as15分钟这是一个完美的重载窗口功能,符合您的要求。
请查找以下工作代码
import org.apache.spark.sql.SparkSession
object SparkWindowTest extends App {
val spark = SparkSession
.builder()
.master("local")
.appName("File_Streaming")
.getOrCreate()
import spark.implicits._
import org.apache.spark.sql.functions._
//Prepare Test Data
val df = Seq((1, 10, "2019-02-17 12:00:49"), (2, 20, "2019-02-17 11:10:46"),
(3, 30, "2019-02-17 13:23:34"),(2, 50, "2019-02-17 11:10:30"),
(1, 40, "2019-02-17 12:01:02"), (1, 60, "2019-02-17 12:01:57"))
.toDF("ID", "Volume", "TimeString")
df.show()
df.printSchema()
+---+------+-------------------+
| ID|Volume| TimeString|
+---+------+-------------------+
| 1| 10|2019-02-17 12:00:49|
| 2| 20|2019-02-17 11:10:46|
| 3| 30|2019-02-17 13:23:34|
| 2| 50|2019-02-17 11:10:30|
| 1| 40|2019-02-17 12:01:02|
| 1| 60|2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
//Converted String Timestamp into Timestamp
val modifiedDF = df.withColumn("Time", to_timestamp($"TimeString"))
//Dropped String Timestamp from DF
val modifiedDF1 = modifiedDF.drop("TimeString")
modifiedDF.show(false)
modifiedDF.printSchema()
+---+------+-------------------+-------------------+
|ID |Volume|TimeString |Time |
+---+------+-------------------+-------------------+
|1 |10 |2019-02-17 12:00:49|2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|2019-02-17 12:01:57|
+---+------+-------------------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- TimeString: string (nullable = true)
|-- Time: timestamp (nullable = true)
modifiedDF1.show(false)
modifiedDF1.printSchema()
+---+------+-------------------+
|ID |Volume|Time |
+---+------+-------------------+
|1 |10 |2019-02-17 12:00:49|
|2 |20 |2019-02-17 11:10:46|
|3 |30 |2019-02-17 13:23:34|
|2 |50 |2019-02-17 11:10:30|
|1 |40 |2019-02-17 12:01:02|
|1 |60 |2019-02-17 12:01:57|
+---+------+-------------------+
root
|-- ID: integer (nullable = false)
|-- Volume: integer (nullable = false)
|-- Time: timestamp (nullable = true)
//Main logic
val modifiedDF2 = modifiedDF1.groupBy($"ID", window($"Time", "1 minutes","1 minutes","45 seconds")).sum("Volume")
//Renamed all columns of DF.
val newNames = Seq("ID", "WINDOW", "VOLUME")
val finalDF = modifiedDF2.toDF(newNames: _*)
finalDF.show(false)
+---+---------------------------------------------+------+
|ID |WINDOW |VOLUME|
+---+---------------------------------------------+------+
|2 |[2019-02-17 11:09:45.0,2019-02-17 11:10:45.0]|50 |
|1 |[2019-02-17 12:01:45.0,2019-02-17 12:02:45.0]|60 |
|1 |[2019-02-17 12:00:45.0,2019-02-17 12:01:45.0]|50 |
|3 |[2019-02-17 13:22:45.0,2019-02-17 13:23:45.0]|30 |
|2 |[2019-02-17 11:10:45.0,2019-02-17 11:11:45.0]|20 |
+---+---------------------------------------------+------+
}
12:00:34
和12:01:02
相隔不到一分钟。但是12:01:02
和12:01:57
也相隔不到一分钟。为什么你不想把这三者结合起来?为什么你更喜欢把前两个结合起来,而不是最后两个?你的最终2019-02-17T11:45:00Z应该到2019年吗-02-17T12:01:45Z。
读作2019-02-17T12:01:45Z到2019-02-17T12:01:45Z
?12:00:34和12:01:02在1分钟内。但12:00:34和12:01:57不是。我不想把它们合并起来,它们几乎是2分钟的一部分。我希望澄清。2019-02-17T11:45:00Z(+1m)(+1m)2019-02-17T12:01:45ZNo,你忽略了我一半的问题。你为什么不合并12:01:02
和12:01:57
?它们相隔55秒。你不想合并它们是因为12:01:02
已经与12:00:34
合并了吗?在这种情况下,如果有一行是12:00:01
嗯,事情会发生变化。您可以将12:00:01
和12:00:34
组合起来,然后分别将12:01:02
和12:01:57
组合起来。这意味着您无法在不返回序列开头并向前滚动的情况下判断要组合哪些行。这是一个顺序循环,在SQL中是不会这样做的。您是对的!再次感谢您坚持我的意见您的第一个测试数据记录不正确。它应该是12:00:34
,而不是12:00:49
。这将改变结果。OP不希望从45秒开始有规律的1分钟间隔。OP需要基于数据的动态窗口。因此,对于ID=1,这将是12:00:34->12:01:34
,然后是12:01:57->12:02:57
,但不是12:01:02->12:02:02
,因为它的起始记录包含在第一个窗口中。要实现这一点要费劲得多(必须转到序列的最开始并向前迭代,这是所有SQL都做不到的)回答得很好!感谢您提供的所有信息,如果有一种方法可以使“startTime”成为动态的,而不是每45秒标记一次,那就太好了。该程序将记录时间,并且每种方式只需增加30秒startTime可以是动态的。您可以将其作为参数传递,而不是硬编码45。e.d.$“time”像这样?从记录时间的角度来看是动态的,如果记录是12:01:20,则不是一般的时间。范围是12:00:20 12:02:20