Apache spark Spark结构化流媒体水印和dropduplicate?;
我正在尝试删除带有水印的副本,问题是水印无法清除状态, 我的代码是:Apache spark Spark结构化流媒体水印和dropduplicate?;,apache-spark,spark-structured-streaming,Apache Spark,Spark Structured Streaming,我正在尝试删除带有水印的副本,问题是水印无法清除状态, 我的代码是: def main(args: Array[String]): Unit = { @transient lazy val log = LogManager.getRootLogger val spark = SparkSession .builder .master("local[2]") .appName("RateResource") .getOrCrea
def main(args: Array[String]): Unit = {
@transient lazy val log = LogManager.getRootLogger
val spark = SparkSession
.builder
.master("local[2]")
.appName("RateResource")
.getOrCreate()
import spark.implicits._
val rateData: DataFrame = spark.readStream.format("rate").load()
val transData = rateData
.select($"timestamp" as "wtimestamp",$"value", $"value"%1000%100%10 as "key",$"value"%1000%100/10%2 as "dkey")
.where("key=0")
val selectData =transData
.withWatermark("wtimestamp","20 seconds") //
.dropDuplicates("dkey","wtimestamp")
val query = selectData.writeStream
.outputMode("update")
.format("console")
.option("truncate", "false")
.start()
query.awaitTermination()
}
和输入记录:
2017-08-09 10:00:10,10
2017-08-09 10:00:20,20
2017-08-09 10:00:30,10
2017-08-09 10:00:10,10
2017-08-09 11:00:30,40
2017-08-09 10:00:10,10
然后第一个“2017-08-09 10:00:10,10”可以输出,第二个“2017-08-09 10:00:10,10”
超过10秒后无法输出
-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:10|10 |0.0|1.0 |
+-------------------+-----+---+----+
-------------------------------------------
Batch: 2
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 3
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:20|20 |0.0|0.0 |
+-------------------+-----+---+----+
-------------------------------------------
Batch: 4
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 5
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:30|10 |0.0|1.0 |
+-------------------+-----+---+----+
-------------------------------------------
Batch: 6
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 7
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+
-------------------------------------------
Batch: 8
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 11:00:30|40 |0.0|0.0 |
+-------------------+-----+---+----+
我通过在窗口中使用maxevent time知道水印的删除状态,但在dropduplicate中,我不知道如何清除状态 操作员通过水印删除复制清除状态。作为您的代码,dropduplicate之前的最新水印是20秒。因此,spark会将当前最长时间内的所有数据保留到20秒,这意味着数据将与过去20分钟的数据进行比较,旧数据将被清除