Apache spark Spark任意有状态流聚合，flatMapGroupsWithState API_Apache Spark_Spark Structured Streaming_Spark Streaming Kafka

Apache spark Spark任意有状态流聚合，flatMapGroupsWithState API

apache-spark

Apache spark Spark任意有状态流聚合，flatMapGroupsWithState API,apache-spark,spark-structured-streaming,spark-streaming-kafka,Apache Spark,Spark Structured Streaming,Spark Streaming Kafka,10天大的spark开发者，试图理解spark的flatMapGroupsWithStateAPI 据我了解：我们向它传递两个选项，它们是超时配置。一个可能的值是“代码”GROPSTATETMEOUT.PurrimeTimeTimeOut[ >，即一种触发考虑处理时间而不是事件时间的指令。另一种是输出模式我们传入一个函数，比如说myFunction，它负责设置每个键的状态。我们还使用groupState.setTimeoutDuration（TimeUnit.HOURS.toMillis（4

10天大的spark开发者，试图理解spark的

flatMapGroupsWithState

API

据我了解：

我们向它传递两个选项，它们是超时配置。一个可能的值是“代码”GROPSTATETMEOUT.PurrimeTimeTimeOut[ >，即一种触发考虑处理时间而不是事件时间的指令。另一种是输出模式

我们传入一个函数，比如说

myFunction

，它负责设置每个键的状态。我们还使用

groupState.setTimeoutDuration（TimeUnit.HOURS.toMillis（4））

设置超时持续时间，假设groupState是密钥的我的groupState实例

据我所知，随着流数据的小批量不断出现，spark将保持我们在用户定义函数中定义的中间状态。假设处理

小批量数据后的中间状态如下：

按键1的状态：

{
  key1: [v1, v2, v3, v4, v5]
}

按键2的状态：

{
   key2: [v11, v12, v13, v14, v15]
}

对于传入的任何新数据，

myFunction

将使用特定键的状态调用。例如，对于

key1

，使用

key1，新的key1值[v1，v2，v3，v4，v5]

调用

myFunction

，并根据逻辑更新

key1

状态

我读了关于超时的内容，发现

timeout指示我们应该等待多长时间才能超时某些中间状态。

问题:

如果此进程无限期运行，我的中间状态将继续堆积，并达到节点上的内存限制。那么这些中间状态何时被清除呢。我发现，在事件时间聚合的情况下，水印指示何时清除中间状态

在处理时间的上下文中，中间状态超时意味着什么

Apache Spark将在过期时间后将其标记为过期，因此在您的示例中，在4小时不活动（实时+4小时，不活动=没有更新状态的新事件）之后

在处理时间的上下文中，中间状态超时意味着什么

这意味着它将根据实际时钟超时（处理时间，

org.apache.spark.util.SystemClock

class）。您可以通过分析

org.apache.spark.sql.streaming.StreamingQueryManager#startQuery

triggerClock

参数来检查当前使用的时钟

您将在

FlatMapGroupsWithStateExec

类中找到更多详细信息，尤其是在这里：

// Generate a iterator that returns the rows grouped by the grouping function
// Note that this code ensures that the filtering for timeout occurs only after
// all the data has been processed. This is to ensure that the timeout information of all
// the keys with data is updated before they are processed for timeouts.
val outputIterator =
  processor.processNewData(filteredIter) ++ processor.processTimedOutState()

如果你分析这两种方法，你会发现：

```
processNewData
```
将映射功能应用于所有活动密钥（存在于微批次中）

```
processTimedOutState
```
对所有过期状态调用映射函数

何时将处理后的数据写入接收器。在哪里配置@Bartosz25它取决于输出模式。您可以在此处找到flatMapGroupsWithState的所有输出模式语义：

    /**
     * For every group, get the key, values and corresponding state and call the function,
     * and return an iterator of rows
     */
    def processNewData(dataIter: Iterator[InternalRow]): Iterator[InternalRow] = {
      val groupedIter = GroupedIterator(dataIter, groupingAttributes, child.output)
      groupedIter.flatMap { case (keyRow, valueRowIter) =>
        val keyUnsafeRow = keyRow.asInstanceOf[UnsafeRow]
        callFunctionAndUpdateState(
          stateManager.getState(store, keyUnsafeRow),
          valueRowIter,
          hasTimedOut = false)
      }
    }

    def processTimedOutState(): Iterator[InternalRow] = {
      if (isTimeoutEnabled) {
        val timeoutThreshold = timeoutConf match {
          case ProcessingTimeTimeout => batchTimestampMs.get
          case EventTimeTimeout => eventTimeWatermark.get
          case _ =>
            throw new IllegalStateException(
              s"Cannot filter timed out keys for $timeoutConf")
        }
        val timingOutPairs = stateManager.getAllState(store).filter { state =>
          state.timeoutTimestamp != NO_TIMESTAMP && state.timeoutTimestamp < timeoutThreshold
        }
        timingOutPairs.flatMap { stateData =>
          callFunctionAndUpdateState(stateData, Iterator.empty, hasTimedOut = true)
        }
      } else Iterator.empty
    }

def callFunctionAndUpdateState(...)
  // ...
  // When the iterator is consumed, then write changes to state
  def onIteratorCompletion: Unit = {
  if (groupState.hasRemoved && groupState.getTimeoutTimestamp == NO_TIMESTAMP) {
    stateManager.removeState(store, stateData.keyRow)
    numUpdatedStateRows += 1
  } else {
    val currentTimeoutTimestamp = groupState.getTimeoutTimestamp
    val hasTimeoutChanged = currentTimeoutTimestamp != stateData.timeoutTimestamp
    val shouldWriteState = groupState.hasUpdated || groupState.hasRemoved || hasTimeoutChanged

    if (shouldWriteState) {
      val updatedStateObj = if (groupState.exists) groupState.get else null
      stateManager.putState(store, stateData.keyRow, updatedStateObj, currentTimeoutTimestamp)
      numUpdatedStateRows += 1
    }
  }
}