Apache kafka 使用Kafka流和加窗流生成统计事件，但如果有较新的加窗事件，则跳过一些加窗事件_Apache Kafka_Apache Kafka Streams_Spring Kafka

Apache kafka 使用Kafka流和加窗流生成统计事件，但如果有较新的加窗事件，则跳过一些加窗事件

apache-kafka

Apache kafka 使用Kafka流和加窗流生成统计事件，但如果有较新的加窗事件，则跳过一些加窗事件,apache-kafka,apache-kafka-streams,spring-kafka,Apache Kafka,Apache Kafka Streams,Spring Kafka,对于过去24小时的事件，我想生成一个统计事件，以实时显示它。因此，当前的行为是：每分钟，我聚合最后24小时事件以添加列表对象我临时计算完整列表并生成最终的统计对象我将最后一个统计对象推到一个新主题我在用卡夫卡流和弹簧靴它工作得很好，我有很好的计算能力，而且在开发过程中制作得很好。问题是当我在生产环境中时，源事件主题包含太多数据如果我的应用程序停止一天或几分钟。当应用程序重新启动时，我的应用程序将尝试恢复历史记录。卡夫卡流从最后一个偏移量开始继续处理，需要花费大量时间才能赶上他的延迟

对于过去24小时的事件，我想生成一个统计事件，以实时显示它。因此，当前的行为是：

每分钟，我聚合最后24小时事件以添加列表对象

我临时计算完整列表并生成最终的统计对象

我将最后一个统计对象推到一个新主题

我在用卡夫卡流和弹簧靴

它工作得很好，我有很好的计算能力，而且在开发过程中制作得很好。问题是当我在生产环境中时，源事件主题包含太多数据

如果我的应用程序停止一天或几分钟。当应用程序重新启动时，我的应用程序将尝试恢复历史记录。卡夫卡流从最后一个偏移量开始继续处理，需要花费大量时间才能赶上他的延迟。事实上，我不在乎历史。我不需要昨天或过去24小时减去1小时的统计对象，我只想重新计算从现在到最后24小时的数据，就这样

如果应用程序正常运行，但处理统计事件有一定的延迟，则相同。滞后时间越来越长。如果延迟变得太重要，我会自动跳过时间窗口，只计算最后一个时间窗口

你认为卡夫卡流能做到吗？提前谢谢

    /**
 * Every minute, we collect all events on the last day and we publish a new statistic event.
 * 
 * @param streamsBuilder
 * @return
 */
@Bean
public KStream<String, MySourceEvent> kstreamMySourceEventStatistique(final StreamsBuilder streamsBuilder) {

    // We create the stream to consume machine-state topic.
    KStream<String, MySourceEvent> kstreamStat = streamsBuilder
            .<String, MySourceEvent>stream("my-source-topic", Consumed
                    .with(Serdes.String(), KafkaUtils
                            .jsonSerdeForClass(MySourceEvent.class)));

    // For this stream, every minute, we take all events in the last 24h, and we aggregate them into TemporaryStatistiqueEvent
    KTable<Windowed<String>, TemporaryStatistiqueEvent> aggregatedStream = kstreamStat 
            .groupByKey(Grouped
                    .with(Serdes.String(), KafkaUtils
                            .jsonSerdeForClass(MySourceEvent.class)))
            .windowedBy(TimeWindows
                    .of(Duration.ofDays(1))
                    .advanceBy(Duration.ofMinutes(1))
                    .grace(Duration.ofSeconds(0)))
            .<TemporaryStatistiqueEvent>aggregate(() -> new TemporaryStatistiqueEvent(), (key, value, logAgg) -> {
                logAgg.add(value); //I add the event in my TemporaryStatistiqueEvent object
                return logAgg;
            }, Materialized
                    .<String, TemporaryStatistiqueEvent, WindowStore<Bytes, byte[]>>as("temporary-stats-store")
                    .withKeySerde(Serdes.String())
                    .withValueSerde(KafkaUtils.jsonSerdeForClass(TemporaryStatistiqueEvent.class))
                    .withRetention(Duration.ofDays(1)))
            .suppress(Suppressed.untilWindowCloses(BufferConfig.unbounded()));
    // Now, we gave an aggregate on last 24h, we compute the statistic and push FinalStatisticEvent object in a new topic
    aggregatedStream
            .toStream()
            .map(new KeyValueMapper<Windowed<String>, TemporaryStatistiqueEvent, KeyValue<String, PlcStatMachineState>>() {
                @Override
                public KeyValue<String, FinalStatisticEvent> apply(final Windowed<String> key, final TemporaryStatistiqueEvent temporaryStatistiqueEvent) {
                    ZonedDateTime zdt = ZonedDateTime.ofInstant(Instant.ofEpochMilli(key.window().end()), ZoneOffset.UTC);
                    return new KeyValue<>(key.key(), temporaryStatistiqueEvent.computeFinalStatisticEvent(zdt));
                }
            })
            .to("final-stat-topic", Produced.with(Serdes.String(), KafkaUtils.jsonSerdeForClass(FinalStatisticEvent .class)));

    return kstreamStat;
}

/**
*每分钟，我们收集最后一天的所有事件，并发布一个新的统计事件。
* 
*@param streamsBuilder
*@返回
*/
@豆子
公共KStream kstreamMySourceEventStatistique（最终StreamsBuilder StreamsBuilder）{
//我们创建流以使用机器状态主题。
KStream kstreamStat=streamsBuilder
.stream（“我的源主题”，已消费
.with（Serdes.String（），KafkaUtils
.jsonSerdeForClass（MySourceEvent.class））；
//对于这个流，每分钟，我们获取过去24小时内的所有事件，并将它们聚合到TemporaryStatistiqueEvent中
KTable aggregatedStream=kstreamStat
.groupByKey（已分组）
.with（Serdes.String（），KafkaUtils
.jsonSerdeForClass（MySourceEvent.class）））
.windowedBy（时间窗口
年月日（持续天数（1））
.预付款（持续时间：分钟（1））
.grace（持续时间秒（0）））
.aggregate（（）->new TemporaryStatistiqueEvent（），（键、值、logAgg）->{
logAgg.add（value）；//我在我的TemporaryStatistiqueEvent对象中添加事件
返回logAgg；
}，具体化
.as（“临时统计数据存储”）
.withKeySerde（Serdes.String（））
.with valueserde（KafkaUtils.jsonSerdeForClass（TemporaryStatistiqueEvent.class））
.保留期（期限为第（1）天）
.suppress（supprested.untilwindowcloss（BufferConfig.unbounded（））；
//现在，我们给出了过去24小时的聚合，我们计算统计数据并将FinalStatisticEvent对象推送到一个新主题中
聚合流
.toStream（）
.map（新的KeyValueMapper（）{
@凌驾
公钥值应用（最终窗口键、最终临时StatistiQueEvent临时StatistiQueEvent）{
ZonedDateTime zdt=ZonedDateTime.ofInstant（Instant.ofEpochMilli（key.window（）.end（）），ZoneOffset.UTC）；
返回新的KeyValue（key.key（），temporaryStatistiqueEvent.computeFinalStatisticEvent（zdt））；
}
})
.to（“最终统计主题”，生成于.with（Serdes.String（），KafkaUtils.jsonSerdeForClass（FinalStatisticEvent.class））；
返回kstreamStat；
}

这是一个棘手的问题

对于脱机和重新启动的情况，您可以尝试在重新启动应用程序之前操纵启动偏移量（即，已提交的偏移量）。使用

bin/kafka consumer group.sh

可以“按时间搜索”，从而“向前跳到现在-24小时”

对于应用程序滞后的情况，情况更为复杂。也许你可以有一个“动态过滤器”（为了访问记录元数据，比如它的时间戳，你可以使用

flatTransformValues

来实现过滤器），作为你程序的第一步，删除太旧的记录