Apache spark Spark结构化流媒体水印和dropduplicate？；_Apache Spark_Spark Structured Streaming

Apache spark Spark结构化流媒体水印和dropduplicate？；

apache-spark

Apache spark Spark结构化流媒体水印和dropduplicate？；,apache-spark,spark-structured-streaming,Apache Spark,Spark Structured Streaming,我正在尝试删除带有水印的副本，问题是水印无法清除状态，我的代码是： def main(args: Array[String]): Unit = { @transient lazy val log = LogManager.getRootLogger val spark = SparkSession .builder .master("local[2]") .appName("RateResource") .getOrCrea

我正在尝试删除带有水印的副本，问题是水印无法清除状态，我的代码是：

 def main(args: Array[String]): Unit = {

    @transient lazy val log = LogManager.getRootLogger

    val spark = SparkSession
      .builder
      .master("local[2]")
      .appName("RateResource")
      .getOrCreate()

    import spark.implicits._
    val rateData: DataFrame = spark.readStream.format("rate").load()

    val transData = rateData
      .select($"timestamp" as "wtimestamp",$"value", $"value"%1000%100%10 as "key",$"value"%1000%100/10%2 as "dkey")
      .where("key=0")


    val selectData =transData
      .withWatermark("wtimestamp","20 seconds")  //
      .dropDuplicates("dkey","wtimestamp") 

    val query = selectData.writeStream
      .outputMode("update")
      .format("console")
      .option("truncate", "false")
      .start()

    query.awaitTermination()

  }

和输入记录：

2017-08-09 10:00:10,10
2017-08-09 10:00:20,20
2017-08-09 10:00:30,10
2017-08-09 10:00:10,10
2017-08-09 11:00:30,40
2017-08-09 10:00:10,10

然后第一个“2017-08-09 10:00:10,10”可以输出，第二个“2017-08-09 10:00:10,10” 超过10秒后无法输出

-------------------------------------------
Batch: 1
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp         |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:10|10   |0.0|1.0 |
+-------------------+-----+---+----+

-------------------------------------------
Batch: 2
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+

-------------------------------------------
Batch: 3
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp         |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:20|20   |0.0|0.0 |
+-------------------+-----+---+----+

-------------------------------------------
Batch: 4
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+

-------------------------------------------
Batch: 5
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp         |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 10:00:30|10   |0.0|1.0 |
+-------------------+-----+---+----+

-------------------------------------------
Batch: 6
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+

-------------------------------------------
Batch: 7
-------------------------------------------
+----------+-----+---+----+
|wtimestamp|value|key|dkey|
+----------+-----+---+----+
+----------+-----+---+----+

-------------------------------------------
Batch: 8
-------------------------------------------
+-------------------+-----+---+----+
|wtimestamp         |value|key|dkey|
+-------------------+-----+---+----+
|2017-08-09 11:00:30|40   |0.0|0.0 |
+-------------------+-----+---+----+

我通过在窗口中使用maxevent time知道水印的删除状态，但在dropduplicate中，我不知道如何清除状态

操作员通过水印删除复制清除状态。作为您的代码，dropduplicate之前的最新水印是20秒。因此，spark会将当前最长时间内的所有数据保留到20秒，这意味着数据将与过去20分钟的数据进行比较，旧数据将被清除