Apache flink 在两个流之间执行间隔联接时,延迟事件似乎没有被删除
我正在使用Flink1.11,我有以下测试用例来尝试基于事件时间的间隔连接 两个流的数据定义如下:Apache flink 在两个流之间执行间隔联接时,延迟事件似乎没有被删除,apache-flink,Apache Flink,我正在使用Flink1.11,我有以下测试用例来尝试基于事件时间的间隔连接 两个流的数据定义如下: object JoinStockInterval { //the stocks data, //ts is the implicit method that converts the time string to timestamp val stocks = Seq( Stock("id1", "2020-09-16 20:50:15".
object JoinStockInterval {
//the stocks data,
//ts is the implicit method that converts the time string to timestamp
val stocks = Seq(
Stock("id1", "2020-09-16 20:50:15".ts, 1),
Stock("id1", "2020-09-16 20:50:12".ts, 2),
Stock("id1", "2020-09-16 20:50:18".ts, 4),
Stock("id1", "2020-09-16 20:50:11".ts, 3),
Stock("id1", "2020-09-16 20:50:11".ts, 10),
Stock("id1", "2020-09-16 20:50:13".ts, 5),
Stock("id1", "2020-09-16 20:50:20".ts, 6),
Stock("id1", "2020-09-16 20:50:14".ts, 7),
Stock("id1", "2020-09-16 20:50:22".ts, 8),
Stock("id1", "2020-09-16 20:50:40".ts, 9),
Stock("id1", "2020-09-16 20:50:15".ts, 100)
)
//Mock that the stock name is changing over time
val stockNameChangings = Seq(
StockNameChanging("id1", "Stock1", "2020-09-16 20:50:16".ts),
StockNameChanging("id1", "Stock101", "2020-09-16 20:50:20".ts),
StockNameChanging("id1", "Stock4", "2020-09-16 20:50:17".ts),
StockNameChanging("id1", "Stock7", "2020-09-16 20:50:21".ts),
StockNameChanging("id1", "Stock5", "2020-09-16 20:50:17".ts),
StockNameChanging("id1", "Stock501", "2020-09-16 20:50:22".ts),
StockNameChanging("id1", "Stock6", "2020-09-16 20:50:23".ts)
)
}
id1,Stock1,1.0,2020-09-16T12:50:15,2020-09-16T12:50:16
id1,Stock1,4.0,2020-09-16T12:50:18,2020-09-16T12:50:16
id1,Stock101,4.0,2020-09-16T12:50:18,2020-09-16T12:50:20
id1,Stock4,4.0,2020-09-16T12:50:18,2020-09-16T12:50:17
id1,Stock4,1.0,2020-09-16T12:50:15,2020-09-16T12:50:17
id1,Stock5,4.0,2020-09-16T12:50:18,2020-09-16T12:50:17
id1,Stock5,1.0,2020-09-16T12:50:15,2020-09-16T12:50:17
id1,Stock101,6.0,2020-09-16T12:50:20,2020-09-16T12:50:20
id1,Stock7,6.0,2020-09-16T12:50:20,2020-09-16T12:50:21
id1,Stock501,6.0,2020-09-16T12:50:20,2020-09-16T12:50:22
id1,Stock1,7.0,2020-09-16T12:50:14,2020-09-16T12:50:16
id1,Stock101,8.0,2020-09-16T12:50:22,2020-09-16T12:50:20
id1,Stock501,8.0,2020-09-16T12:50:22,2020-09-16T12:50:22
id1,Stock7,8.0,2020-09-16T12:50:22,2020-09-16T12:50:21
id1,Stock6,8.0,2020-09-16T12:50:22,2020-09-16T12:50:23
id1,Stock1,100.0,2020-09-16T12:50:15,2020-09-16T12:50:16
id1,Stock4,100.0,2020-09-16T12:50:15,2020-09-16T12:50:17
id1,Stock5,100.0,2020-09-16T12:50:15,2020-09-16T12:50:17
测试用例定义如下,每个用例允许4秒延迟:
test("test interval join inner 2 works") {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
val ds1 = env.addSource(new IntervalJoinStockSource(emitInterval = 0)).assignTimestampsAndWatermarks(new StockWatermarkGenerator(4000)) //allow 4 secs lateness
val ds2 = env.addSource(new IntervalJoinStockNameChangingSource(emitInterval = 0)).assignTimestampsAndWatermarks(new StockNameChangingWatermarkGenerator(4000)) //allow 4 secs lateness
val tenv = StreamTableEnvironment.create(env)
tenv.createTemporaryView("s1", ds1, $"id", $"price", $"trade_date".rowtime() as "rt1")
tenv.createTemporaryView("s2", ds2, $"id", $"name", $"trade_date".rowtime() as "rt2")
tenv.from("s1").printSchema()
tenv.from("s2").printSchema()
val sql =
"""
select s1.id, s2.name, s1.price, cast (s1.rt1 as timestamp) as rt1, s2.rt2
from s1 join s2
on s1.id = s2.id
where s1.rt1 between s2.rt2 - interval '2' second and s2.rt2 + interval '2' second
""".stripMargin(' ')
tenv.sqlQuery(sql).toAppendStream[Row].print()
env.execute()
}
连接结果如下所示:
object JoinStockInterval {
//the stocks data,
//ts is the implicit method that converts the time string to timestamp
val stocks = Seq(
Stock("id1", "2020-09-16 20:50:15".ts, 1),
Stock("id1", "2020-09-16 20:50:12".ts, 2),
Stock("id1", "2020-09-16 20:50:18".ts, 4),
Stock("id1", "2020-09-16 20:50:11".ts, 3),
Stock("id1", "2020-09-16 20:50:11".ts, 10),
Stock("id1", "2020-09-16 20:50:13".ts, 5),
Stock("id1", "2020-09-16 20:50:20".ts, 6),
Stock("id1", "2020-09-16 20:50:14".ts, 7),
Stock("id1", "2020-09-16 20:50:22".ts, 8),
Stock("id1", "2020-09-16 20:50:40".ts, 9),
Stock("id1", "2020-09-16 20:50:15".ts, 100)
)
//Mock that the stock name is changing over time
val stockNameChangings = Seq(
StockNameChanging("id1", "Stock1", "2020-09-16 20:50:16".ts),
StockNameChanging("id1", "Stock101", "2020-09-16 20:50:20".ts),
StockNameChanging("id1", "Stock4", "2020-09-16 20:50:17".ts),
StockNameChanging("id1", "Stock7", "2020-09-16 20:50:21".ts),
StockNameChanging("id1", "Stock5", "2020-09-16 20:50:17".ts),
StockNameChanging("id1", "Stock501", "2020-09-16 20:50:22".ts),
StockNameChanging("id1", "Stock6", "2020-09-16 20:50:23".ts)
)
}
id1,Stock1,1.0,2020-09-16T12:50:15,2020-09-16T12:50:16
id1,Stock1,4.0,2020-09-16T12:50:18,2020-09-16T12:50:16
id1,Stock101,4.0,2020-09-16T12:50:18,2020-09-16T12:50:20
id1,Stock4,4.0,2020-09-16T12:50:18,2020-09-16T12:50:17
id1,Stock4,1.0,2020-09-16T12:50:15,2020-09-16T12:50:17
id1,Stock5,4.0,2020-09-16T12:50:18,2020-09-16T12:50:17
id1,Stock5,1.0,2020-09-16T12:50:15,2020-09-16T12:50:17
id1,Stock101,6.0,2020-09-16T12:50:20,2020-09-16T12:50:20
id1,Stock7,6.0,2020-09-16T12:50:20,2020-09-16T12:50:21
id1,Stock501,6.0,2020-09-16T12:50:20,2020-09-16T12:50:22
id1,Stock1,7.0,2020-09-16T12:50:14,2020-09-16T12:50:16
id1,Stock101,8.0,2020-09-16T12:50:22,2020-09-16T12:50:20
id1,Stock501,8.0,2020-09-16T12:50:22,2020-09-16T12:50:22
id1,Stock7,8.0,2020-09-16T12:50:22,2020-09-16T12:50:21
id1,Stock6,8.0,2020-09-16T12:50:22,2020-09-16T12:50:23
id1,Stock1,100.0,2020-09-16T12:50:15,2020-09-16T12:50:16
id1,Stock4,100.0,2020-09-16T12:50:15,2020-09-16T12:50:17
id1,Stock5,100.0,2020-09-16T12:50:15,2020-09-16T12:50:17
奇怪的是,上面结果中的最后一条记录来自股票流中的股票(“id1”,“2020-09-16 20:50:15”.ts,100)
,但这条记录在股票流中出现得较晚。
看到股票流中的以下两个记录,我已经在这个问题上纠缠了好几天,我想问为什么这个记录没有被删除,而是成功地与另一个流(名称更改流)连接在一起
水印策略使用的是带有标点水印的赋值器您想知道的记录
Stock("id1", "2020-09-16 20:50:15".ts, 100)
从加入的角度来看,这并不晚
其原因与在运算符具有多个输入(如此间隔连接)的情况下如何传播水印有关。连接运算符处的当前水印始终是从所有输入通道接收到的水印中最小的一个
因此,在联接处理完此记录之前
StockNameChanging("id1", "Stock501", "2020-09-16 20:50:22".ts)
StockNameChanging("id1", "Stock7", "2020-09-16 20:50:21".ts)
连接处的水印由此记录确定
StockNameChanging("id1", "Stock501", "2020-09-16 20:50:22".ts)
StockNameChanging("id1", "Stock7", "2020-09-16 20:50:21".ts)
因此水印仍然在为连接定义的时间间隔内
水印以这种方式工作,因为它们表示一种断言,即流现在可以被认为是完整的,直到水印的时间戳为止。从连接的角度来看,它只知道最远的流的水印。感谢@David的伟大贡献!我想我已经明白了,