Apache spark Spark结构化流中的密钥在完成';GroupStateTimeout.ProcessingTimeout()';
我正在编写结构化流的代码,其中我从Kafka队列订阅数据,并将这些原始数据写回Hbase。在这笔交易之间,我必须满足以下要求:Apache spark Spark结构化流中的密钥在完成';GroupStateTimeout.ProcessingTimeout()';,apache-spark,apache-kafka,spark-structured-streaming,Apache Spark,Apache Kafka,Spark Structured Streaming,我正在编写结构化流的代码,其中我从Kafka队列订阅数据,并将这些原始数据写回Hbase。在这笔交易之间,我必须满足以下要求: 流中的数据必须在2小时的窗口内消除重复,即,每当有新密钥的数据进入时,密钥应在内存中保留2小时,并且这2小时内的所有重复数据不得发送到Hbase 如果密钥的新记录已处于状态,但值已更改,则应将更新后的记录发送到Hbase,并且该密钥应在此之后在内存中保留2小时 无法确定数据可能到达的时间有多晚,任何传入的数据都将满足上述任何条件 由于条件2和3,我不能使用spark提供
val kafkaIpStream = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBroker)
.option("subscribe", topic)
.option("startingOffsets", "earliest")
.load()
要消除重复的代码
val kafkaStream = kafkaIpStream.selectExpr("cast (key as String)", "cast (value as String)")
.withColumn("ts", split($"key", "/")(1))
.selectExpr("key as rowkey", "ts", "value as val")
.withColumn("isValid", validationUDF($"rowkey", $"ts", $"val"))
.as[inputTsRecord]
.groupByKey(_.rowkey)
.flatMapGroupsWithState(OutputMode.Update(), GroupStateTimeout.ProcessingTimeTimeout())(updateStateAccrossRecords)
.toDF("rowkey", "ts", "val", "isValid")
重复数据消除功能
case class inputTsRecord(rowkey: String, ts: String, `val`: String, isValid: String)
case class state(rowkey: String, `val`: String, insertTimestamp: Long)
def updateStateAccrossRecords(rowKey: String, inputRows: Iterator[inputTsRecord], oldState: GroupState[state]): Iterator[inputTsRecord] = {
inputRows.toSeq.toIterator.flatMap { iprow =>
println("received data for " + iprow.rowkey)
if (oldState.hasTimedOut) {
println("State timed out")
oldState.remove()
Iterator()
}
else if (oldState.exists) {
println("State exists for " + iprow.rowkey)
val timeDuration=((((System.currentTimeMillis / 1000)-oldState.get.insertTimestamp)/60)/60)
println("State not timed out for " + iprow.rowkey)
println("Duration passed " + timeDuration)
val updatedState = state(iprow.rowkey, iprow.`val`, (System.currentTimeMillis / 1000))
val isValChanged = if (updatedState.`val` == oldState.get.`val`) false else true
if (isValChanged) {
println("value changed for " + iprow.rowkey)
oldState.update(updatedState)
oldState.setTimeoutDuration("2 hours")
Iterator(iprow)
} else {
if (timeDuration >= 2)
{
println("removing state for " + iprow.rowkey)
oldState.remove()
}
println("value not changed for " + iprow.rowkey)
Iterator()
}
} else {
println("State does not exists for " + iprow.rowkey)
val newState = state(iprow.rowkey, iprow.`val`, (System.currentTimeMillis / 1000))
oldState.update(newState)
oldState.setTimeoutDuration("2 hours")
Iterator(iprow)
}
}
}
现在的问题是:
if (timeDuration >= 2){
println("removing state for " + iprow.rowkey)
oldState.remove()
}