带有Kafka源代码和数据流运行程序的Beam java SDK 2.10.0:windowed Count.perElement从不激发数据_Java_Google Cloud Dataflow_Apache Beam

带有Kafka源代码和数据流运行程序的Beam java SDK 2.10.0:windowed Count.perElement从不激发数据

java google-cloud-dataflow

带有Kafka源代码和数据流运行程序的Beam java SDK 2.10.0:windowed Count.perElement从不激发数据,java,google-cloud-dataflow,apache-beam,Java,Google Cloud Dataflow,Apache Beam,我在Google数据流上运行Beam SDK to 2.10.0作业时遇到问题流程很简单：我使用Kafka作为源，然后应用固定窗口，然后按键计数元素。但看起来数据永远不会离开计数阶段，直到工作耗尽。Count.PerElement/Combine.perKey（Count）/Combine.GroupedValues.out0的输出集合始终为零。元素仅在排出数据流作业后发出代码如下： public KafkaProcessingJob(BaseOptions options) {

我在Google数据流上运行Beam SDK to 2.10.0作业时遇到问题

流程很简单：我使用Kafka作为源，然后应用固定窗口，然后按键计数元素。但看起来数据永远不会离开计数阶段，直到工作耗尽。Count.PerElement/Combine.perKey（Count）/Combine.GroupedValues.out0的输出集合始终为零。元素仅在排出数据流作业后发出

代码如下：

public KafkaProcessingJob(BaseOptions options) {

    PCollection<GenericRecord> genericRecordPCollection = Pipeline.create(options)
                     .apply("Read binary Kafka messages", KafkaIO.<String, byte[]>read()
                           .withBootstrapServers(options.getBootstrapServers())
                           .updateConsumerProperties(configureConsumerProperties())
                           .withCreateTime(Duration.standardMinutes(1L))
                           .withTopics(inputTopics)
                           .withReadCommitted()
                           .commitOffsetsInFinalize()
                           .withKeyDeserializer(StringDeserializer.class)
                           .withValueDeserializer(ByteArrayDeserializer.class))

                    .apply("Map binary message to Avro GenericRecord", new DecodeBinaryKafkaMessage());

                    .apply("Apply windowing to records", Window.into(FixedWindows.of(Duration.standardMinutes(5)))
                                       .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
                                       .discardingFiredPanes()
                                       .withAllowedLateness(Duration.standardMinutes(5)))

                    .apply("Write aggregated data to BigQuery", MapElements.into(TypeDescriptors.strings()).via(rec -> getKey(rec)))
                            .apply(Count.<String>perElement())
                            .apply(
                                new WriteWindowedToBigQuery<>(
                                    project,
                                    dataset,
                                    table,
                                    configureWindowedTableWrite()));   
}

private Map<String, Object> configureConsumerProperties() {
    Map<String, Object> configUpdates = Maps.newHashMap();
    configUpdates.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest");

    return configUpdates;
}

private static String getKey(GenericRecord record) {
    //extract key
}

公共卡夫卡处理作业（基本选项）{
PCollection genericRecordPCollection=Pipeline.create（选项）
.apply（“读取二进制卡夫卡消息”，KafkaIO.Read（）
.withBootstrapServers（options.getBootstrapServers（））
.updateConsumerProperties（配置ConsumerProperties（））
.withCreateTime（持续时间.标准分钟（1L））
.带主题（输入OPICS）
.withReadCommitted（）
.CommitteofSetsinFinalize（）
.withKeyDeserializer（StringDeserializer.class）
.withValueDeserializer（byteArraydSerializer.class））
.apply（“将二进制消息映射到Avro GenericRecord”，新的DecodeBaryKafkCamessage（））；
.apply（“将窗口应用于记录”，Window.into（FixedWindows.of（Duration.standardMinutes（5）））
.triggering（重复.forever（AfterWatermark.pastEndOfWindow（）））
.丢弃Firedpanes（）
.允许迟到（持续时间.标准分钟（5）））
.apply（“将聚合数据写入BigQuery”，MapElements.into（TypeDescriptors.strings（））。通过（rec->getKey（rec）））
.apply（Count.perElement（））
.申请(
新WriteWindowedToBigQuery(
项目
数据集，
桌子
已配置WindowedTableWrite（））；
}
私有映射配置ConsumerProperties（）{
Map configUpdates=Maps.newHashMap（）；
configUpdates.put（ConsumerConfig.AUTO_OFFSET_RESET_CONFIG，“最早”）；
返回配置更新；
}
私有静态字符串getKey（GenericRecord记录）{
//提取键
}

看起来流从未离开

.apply（Count.perElement（））

有人能帮忙吗？

我找到了原因

它与此处使用的时间戳策略相关（

.withCreateTime（Duration.standardMinutes（1L））

）

由于Kafka主题中存在空分区，因此主题水印从未使用默认时间戳策略进行升级。

我需要实施自定义策略来解决此问题。

我已找到原因

它与此处使用的时间戳策略相关（

.withCreateTime（Duration.standardMinutes（1L））

）

由于Kafka主题中存在空分区，因此主题水印从未使用默认时间戳策略进行升级。我需要实施自定义策略来解决这个问题