Apache kafka 抑制的Kafka流聚合：缓存大小对聚合结果的影响_Apache Kafka_Apache Kafka Streams_Spring Cloud Stream

Apache kafka 抑制的Kafka流聚合：缓存大小对聚合结果的影响

apache-kafka

Apache kafka 抑制的Kafka流聚合：缓存大小对聚合结果的影响,apache-kafka,apache-kafka-streams,spring-cloud-stream,Apache Kafka,Apache Kafka Streams,Spring Cloud Stream,我使用kafka streams组件在30分钟的滑动窗口上构建聚合（sum），宽限期为2分钟。我正在处理10000个时间序列（组）。聚合使用禁用日志记录的持久状态存储。为了在聚合间隔结束时仅输出最终结果，我使用了抑制运算符对于大多数时间序列，都正确计算了聚合，但有一小部分时间序列计算不正确。在这些情况下，聚合值在许多情况下正好反映输入流中的一个记录值。最初，我使用的是启用默认记录缓存的聚合（cache.max.bytes.buffering10 MB）和启用nobound（）选项的抑制，以

我使用kafka streams组件在30分钟的滑动窗口上构建聚合（sum），宽限期为2分钟。我正在处理10000个时间序列（组）。聚合使用禁用日志记录的持久状态存储。为了在聚合间隔结束时仅输出最终结果，我使用了抑制运算符

对于大多数时间序列，都正确计算了聚合，但有一小部分时间序列计算不正确。在这些情况下，聚合值在许多情况下正好反映输入流中的一个记录值。最初，我使用的是启用默认记录缓存的聚合（

cache.max.bytes.buffering

10 MB）和启用nobound（）选项的抑制，以根据需要继续分配更多内存

根据以下建议：

抑制缓冲内存独立于流的记录缓存，因此确保您有足够的堆来承载记录缓存（cache.max.bytes.buffering）除了所有抑制缓冲区大小

我将记录缓存大小（

cache.max.bytes.buffering

）从10MB（默认）增加到100MB，结果的准确性得到了显著提高。然而，我仍然可以不时发现一些群体的总量计算不正确的情况

我的聚合管道：

@StreamListener
@SendTo("output-aggregated")
public KStream<String, Aggregate> aggregatePipeline(
        @Input("input-event") KStream<String, Event> inputEventKStream) {

    Duration windowDuration = Duration.ofMinutes(30);
    Duration retentionPeriod = windowDuration;
    Duration advanceDuration = Duration.ofMinutes(1);
    Duration graceDuration = Duration.ofMinutes(2);

    // custom state store
    WindowBytesStoreSupplier timestampedWindowStore = Stores.persistentTimestampedWindowStore("aggregate-30m",
            retentionPeriod, windowDuration, true);
    Materialized<String, Aggregate, WindowStore<Bytes,byte[]>> materializedCustomStore = Materialized.as(timestampedWindowStore);
    materializedCustomStore.withKeySerde(Serdes.String()).withLoggingDisabled();

    // stream processing
    TimeWindowedKStream<String, Event> timeWindowedKStream = inputEventKStream
            .filter((key, value) -> isEventMatchingAggregationInterval(value, windowDuration))
            .groupBy((key, value) -> Utils.toTimeserieName(value.getSource(), value.getDimensions()))
            .windowedBy(TimeWindows.of(windowDuration).advanceBy(advanceDuration).grace(graceDuration));

    KTable<Windowed<String>, Aggregate> aggregatedKTable = timeWindowedKStream
            .aggregate(
                    () -> new Aggregate(),
                    (key, newValue, aggregate) -> {
                        aggregate.setSum(newValue.getValue() + aggregate.getSum());
                        aggregate.setCount(aggregate.getCount() + 1);
                        return aggregate;
                    },
                    materializedCustomStore);

    return aggregatedKTable
            .suppress(Suppressed.untilWindowCloses(Suppressed.BufferConfig.unbounded().withNoBound()).withName("suppressed-30m"))
            .toStream()
            .map((key, value) -> new KeyValue<String, Aggregate>(key.key(), value))
    ;
}

@StreamListener
@发送到（“输出聚合”）
公共KStream聚合线(
@输入（“输入事件”）KStream inputEventKStream）{
持续时间窗口持续时间=持续时间（30分钟）；
持续时间保留期=窗口持续时间；
持续时间提前=分钟的持续时间（1）；
持续时间=分钟的持续时间（2）；
//自定义状态存储
WindowByTessStoreSupplier TimestampedWindowsStore=Stores.PersistentTimestampedWindowsStore（“聚合-30m”，
保留期，窗口持续时间，真）；
Materialized Materialized CustomStore=Materialized.as（TimestampedWindowsStore）；
materializedCustomStore.withKeySerde（Serdes.String（））.withLoggingDisabled（）；
//流处理
TimeWindowedKStream TimeWindowedKStream=inputEventKStream
.filter（（键，值）->IsEventMatchingGregationInterval（值，windowDuration））
.groupBy（（键，值）->Utils.toTimeserieName（value.getSource（），value.getDimensions（））
.windowedBy（TimeWindows.of（windowDuration）.advanceBy（advanceDuration）.grace（graceDuration））；
KTable aggregatedKTable=timeWindowedKStream
.合计(
（）->新聚合（），
（键、新值、聚合）->{
setSum（newValue.getValue（）+aggregate.getSum（））；
aggregate.setCount（aggregate.getCount（）+1）；
总回报；
},
物化客户商店）；
返回聚合表
.suppress（supprested.untillwindowcloss（supprested.BufferConfig.unbound（）.withNoBound（））.withName（“supprested-30m”））
.toStream（）
.map（（key，value）->新的KeyValue（key.key（），value））
;
}

我的环境： 卡夫卡流：2.5.0，Spring Cloud Hoxton.SR8，Spring Boot 2.3.2

我的问题是:

如何正确调整记录缓存的大小以确保正确计算所有聚合

为什么缓存大小会影响最终结果的计算？既然suppress是聚合的后续下游操作符，那么suppress操作符不会看到给定窗口的所有中间结果吗？假定缓存已满，则会刷新这些结果吗

从文件中：

缓存的语义是，只要最早的commit.interval.ms或cache.max.bytes.buffering（缓存压力）命中，数据就会刷新到状态存储并转发到下一个下游处理器节点。commit.interval.ms和cache.max.bytes.buffering都是全局参数

当缓存大小得到充分利用并触发刷新时，是否有选项触发某些日志

您对跟踪记录缓存和抑制缓冲区利用率的最佳指标有何建议？除了“卡夫卡流、状态、抑制、缓冲区、大小”、“卡夫卡流、记录、缓存、命中率”之外，还有什么别的吗

非常感谢

上述方法的问题出现在以下代码行中：

 WindowBytesStoreSupplier timestampedWindowStore = Stores.persistentTimestampedWindowStore("aggregate-30m",
     retentionPeriod, windowDuration, **true**);

标记为true的标志对应于状态存储的配置。因为这是真的，所以存储了相同密钥条目的重复条目。缓冲区记录缓存一满，中间聚合结果就被推送到存储区。由于保留了同一密钥的多个中间结果，因此无法正确计算同一密钥的后续聚合操作（即，为同一密钥存储了多个中间结果）

也就是说，即使记录缓存大小为零，禁用retainDuplicate选项也可以解决这个问题。聚合的最终结果将不受记录缓存大小的影响

    WindowBytesStoreSupplier timestampedWindowStore = Stores.persistentTimestampedWindowStore("aggregate-30m",
        retentionPeriod, windowDuration, false);

问题不在于抑制运算符，而在于错误配置聚合运算符使用的状态存储

包org.apache.kafka.streams.processor.internals的调试级日志提供提交信息