Streaming Apache Beam:使用Withtimestamp分配事件时间时出错_Streaming_Apache Beam

Streaming Apache Beam:使用Withtimestamp分配事件时间时出错

streaming

Streaming Apache Beam:使用Withtimestamp分配事件时间时出错,streaming,apache-beam,Streaming,Apache Beam,我有一个无限卡夫卡流发送数据与下列字段 {"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"} 我使用ApacheBeamSDK for kafka读取流 import org.apache.beam.sdk.io.kafka.KafkaIO; pipeline.apply(KafkaIO.<Long, String>read() .withBo

我有一个无限卡夫卡流发送数据与下列字段

{"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"}

我使用ApacheBeamSDK for kafka读取流

import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.<Long, String>read()
                    .withBootstrapServers("kafka:9092")
                    .withTopic("test")
                    .withKeyDeserializer(LongDeserializer.class)
                    .withValueDeserializer(StringDeserializer.class)
                    .updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true")) 
                    .updateConsumerProperties(ImmutableMap.of("group.id", "Consumer1"))
                    .commitOffsetsInFinalize()
                    .withoutMetadata()))

import org.apache.beam.sdk.io.kafka.KafkaIO；
pipeline.apply（KafkaIO.read（））
.使用BootstrapServer（“卡夫卡：9092”）
.withTopic（“测试”）
.withKeyDeserializer（LongDeserializer.class）
.withValueDeserializer（StringDeserializer.class）
.updateConsumerProperties（ImmutableMap.of（“enable.auto.commit”、“true”））
.updateConsumerProperties（ImmutableMap.of（“group.id”，“Consumer1”））
.CommitteofSetsinFinalize（）
.withoutMetadata（））

因为我想使用事件时间（在我的示例中是“ts”）打开窗口，所以我解析传入字符串并将传入数据流的“ts”字段指定为时间戳

PCollection<Temperature> tempCollection = p.apply(new SetupKafka()) .apply(ParDo.of(new ReadFromTopic())) .apply("ParseTemperature", ParDo.of(new ParseTemperature())); tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));

PCollection tempCollection=p.apply（新设置kafka（）） .apply（ParDo.of（new ReadFromTopic（））） .apply（“ParseTemperature”，ParDo.of（new ParseTemperature（）））； apply（“AssignTimeStamps”，WithTimestamps.of（us->newinstant（us.getTimestamp（）））；
窗口函数和计算应用如下：

PCollection<Output> output = tempCollection.apply(Window .<Temperature>into(FixedWindows.of(Duration.standardSeconds(30))) .triggering(AfterWatermark.pastEndOfWindow() .withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10)))) .withAllowedLateness(Duration.standardDays(1)) .accumulatingFiredPanes()) .apply(new ComputeMax());

PCollection输出=tempCollection.apply（窗口 .into（固定窗口数（持续时间标准秒数（30））） .triggering（AfterWatermark.pastEndOfWindow（）后） .withLateFirings（在processingTime.pastFirstElementInPane（）.plusDelayOf（Duration.standardSeconds（10））之后） .允许迟到（持续时间.标准天数（1）） .累积FiredPanes（）） .apply（新的ComputeMax（））；
由于在实际场景中，事件时间戳通常早于处理时间戳，因此我以5秒的延迟将数据流传输到输入流中

PCollection<Temperature> tempCollection = p.apply(new SetupKafka()) .apply(ParDo.of(new ReadFromTopic())) .apply("ParseTemperature", ParDo.of(new ParseTemperature())); tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));
我得到以下错误：
无法使用时间戳2019-01-16T11:15:45.560Z输出。输出时间戳不得早于当前输入的时间戳（2019-01-16T11:16:50.640Z）减去允许的倾斜（0毫秒）。有关更改的详细信息，请参阅DoFn#getAllowedTimestampSkew（）Javadoc 允许的倾斜
如果我注释掉AssignTimeStamps的行，则没有错误，但我猜，这是在考虑处理时间
如何确保我的计算和窗口基于事件时间而不是处理时间

请提供一些有关如何处理此场景的信息。
您是否有机会使用时间戳策略尝试此操作，很抱歉，我自己没有尝试过此操作，但我相信在2.9.0版中，您应该在阅读卡夫卡约的同时使用此策略

要能够使用自定义时间戳，首先需要通过扩展
timestampolicy
例如：

public class CustomFieldTimePolicy extends TimestampPolicy<String, Foo> { protected Instant currentWatermark; public CustomFieldTimePolicy(Optional<Instant> previousWatermark) { currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE); } @Override public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, Foo> record) { currentWatermark = new Instant(record.getKV().getValue().getTimestamp()); return currentWatermark; } @Override public Instant getWatermark(PartitionContext ctx) { return currentWatermark; }
此行负责创建一个新的timestampolicy，传递一个相关的分区和以前的检查点水印

是否有必要使用此CustomFieldTimePolicy，或者我们可以在读取后使用映射来分配时间戳，有什么区别？
withTimestampPolicyFactory(tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))