Apache spark Spark任意有状态流聚合,flatMapGroupsWithState API
10天大的spark开发者,试图理解spark的Apache spark Spark任意有状态流聚合,flatMapGroupsWithState API,apache-spark,spark-structured-streaming,spark-streaming-kafka,Apache Spark,Spark Structured Streaming,Spark Streaming Kafka,10天大的spark开发者,试图理解spark的flatMapGroupsWithStateAPI 据我了解: 我们向它传递两个选项,它们是超时配置。一个可能的值是“代码”GROPSTATETMEOUT.PurrimeTimeTimeOut[ >,即一种触发考虑处理时间而不是事件时间的指令。另一种是输出模式 我们传入一个函数,比如说myFunction,它负责设置每个键的状态。我们还使用groupState.setTimeoutDuration(TimeUnit.HOURS.toMillis(4
flatMapGroupsWithState
API
据我了解:
myFunction
,它负责设置每个键的状态。我们还使用groupState.setTimeoutDuration(TimeUnit.HOURS.toMillis(4))
设置超时持续时间,假设groupState是密钥的我的groupState实例n
小批量数据后的中间状态如下:
按键1的状态:
{
key1: [v1, v2, v3, v4, v5]
}
按键2的状态:
{
key2: [v11, v12, v13, v14, v15]
}
对于传入的任何新数据,myFunction
将使用特定键的状态调用。例如,对于key1
,使用key1,新的key1值[v1,v2,v3,v4,v5]
调用myFunction
,并根据逻辑更新key1
状态
我读了关于超时的内容,发现timeout指示我们应该等待多长时间才能超时某些中间状态。
问题:
org.apache.spark.util.SystemClock
class)。您可以通过分析org.apache.spark.sql.streaming.StreamingQueryManager#startQuery
triggerClock
参数来检查当前使用的时钟
您将在FlatMapGroupsWithStateExec
类中找到更多详细信息,尤其是在这里:
// Generate a iterator that returns the rows grouped by the grouping function
// Note that this code ensures that the filtering for timeout occurs only after
// all the data has been processed. This is to ensure that the timeout information of all
// the keys with data is updated before they are processed for timeouts.
val outputIterator =
processor.processNewData(filteredIter) ++ processor.processTimedOutState()
如果你分析这两种方法,你会发现:
将映射功能应用于所有活动密钥(存在于微批次中)processNewData
对所有过期状态调用映射函数processTimedOutState
何时将处理后的数据写入接收器。在哪里配置@Bartosz25它取决于输出模式。您可以在此处找到flatMapGroupsWithState的所有输出模式语义:
/**
* For every group, get the key, values and corresponding state and call the function,
* and return an iterator of rows
*/
def processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow] = {
val groupedIter = GroupedIterator(dataIter, groupingAttributes, child.output)
groupedIter.flatMap { case (keyRow, valueRowIter) =>
val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow]
callFunctionAndUpdateState(
stateManager.getState(store, keyUnsafeRow),
valueRowIter,
hasTimedOut = false)
}
}
def processTimedOutState(): Iterator[InternalRow] = {
if (isTimeoutEnabled) {
val timeoutThreshold = timeoutConf match {
case ProcessingTimeTimeout => batchTimestampMs.get
case EventTimeTimeout => eventTimeWatermark.get
case _ =>
throw new IllegalStateException(
s"Cannot filter timed out keys for $timeoutConf")
}
val timingOutPairs = stateManager.getAllState(store).filter { state =>
state.timeoutTimestamp != NO_TIMESTAMP && state.timeoutTimestamp < timeoutThreshold
}
timingOutPairs.flatMap { stateData =>
callFunctionAndUpdateState(stateData, Iterator.empty, hasTimedOut = true)
}
} else Iterator.empty
}
def callFunctionAndUpdateState(...)
// ...
// When the iterator is consumed, then write changes to state
def onIteratorCompletion: Unit = {
if (groupState.hasRemoved && groupState.getTimeoutTimestamp == NO_TIMESTAMP) {
stateManager.removeState(store, stateData.keyRow)
numUpdatedStateRows += 1
} else {
val currentTimeoutTimestamp = groupState.getTimeoutTimestamp
val hasTimeoutChanged = currentTimeoutTimestamp != stateData.timeoutTimestamp
val shouldWriteState = groupState.hasUpdated || groupState.hasRemoved || hasTimeoutChanged
if (shouldWriteState) {
val updatedStateObj = if (groupState.exists) groupState.get else null
stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp)
numUpdatedStateRows += 1
}
}
}