Streaming Apache Beam:使用Withtimestamp分配事件时间时出错

Streaming Apache Beam:使用Withtimestamp分配事件时间时出错,streaming,apache-beam,Streaming,Apache Beam,我有一个无限卡夫卡流发送数据与下列字段 {"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"} 我使用ApacheBeamSDK for kafka读取流 import org.apache.beam.sdk.io.kafka.KafkaIO; pipeline.apply(KafkaIO.<Long, String>read() .withBo

我有一个无限卡夫卡流发送数据与下列字段

{"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"}
我使用ApacheBeamSDK for kafka读取流

import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.<Long, String>read()
                    .withBootstrapServers("kafka:9092")
                    .withTopic("test")
                    .withKeyDeserializer(LongDeserializer.class)
                    .withValueDeserializer(StringDeserializer.class)
                    .updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true")) 
                    .updateConsumerProperties(ImmutableMap.of("group.id", "Consumer1"))
                    .commitOffsetsInFinalize()
                    .withoutMetadata()))
import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.read())
.使用BootstrapServer(“卡夫卡:9092”)
.withTopic(“测试”)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of(“enable.auto.commit”、“true”))
.updateConsumerProperties(ImmutableMap.of(“group.id”,“Consumer1”))
.CommitteofSetsinFinalize()
.withoutMetadata())
因为我想使用事件时间(在我的示例中是“ts”)打开窗口,所以我解析传入字符串并将传入数据流的“ts”字段指定为时间戳

PCollection<Temperature> tempCollection = p.apply(new SetupKafka())
                    .apply(ParDo.of(new ReadFromTopic()))
                    .apply("ParseTemperature", ParDo.of(new ParseTemperature()));

tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));  
PCollection tempCollection=p.apply(新设置kafka())
.apply(ParDo.of(new ReadFromTopic()))
.apply(“ParseTemperature”,ParDo.of(new ParseTemperature()));
apply(“AssignTimeStamps”,WithTimestamps.of(us->newinstant(us.getTimestamp()));
窗口函数和计算应用如下:

PCollection<Output> output = tempCollection.apply(Window
                .<Temperature>into(FixedWindows.of(Duration.standardSeconds(30)))
                .triggering(AfterWatermark.pastEndOfWindow()
                        .withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))))
                .withAllowedLateness(Duration.standardDays(1))
                .accumulatingFiredPanes())
                .apply(new ComputeMax());
PCollection输出=tempCollection.apply(窗口
.into(固定窗口数(持续时间标准秒数(30)))
.triggering(AfterWatermark.pastEndOfWindow()后)
.withLateFirings(在processingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))之后)
.允许迟到(持续时间.标准天数(1))
.累积FiredPanes())
.apply(新的ComputeMax());
由于在实际场景中,事件时间戳通常早于处理时间戳,因此我以5秒的延迟将数据流传输到输入流中

PCollection<Temperature> tempCollection = p.apply(new SetupKafka())
                    .apply(ParDo.of(new ReadFromTopic()))
                    .apply("ParseTemperature", ParDo.of(new ParseTemperature()));

tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));  
我得到以下错误:

无法使用时间戳2019-01-16T11:15:45.560Z输出。输出 时间戳不得早于当前输入的时间戳 (2019-01-16T11:16:50.640Z)减去允许的倾斜(0毫秒)。 有关更改的详细信息,请参阅DoFn#getAllowedTimestampSkew()Javadoc 允许的倾斜

如果我注释掉AssignTimeStamps的行,则没有错误,但我猜,这是在考虑处理时间

如何确保我的计算和窗口基于事件时间而不是处理时间


请提供一些有关如何处理此场景的信息。

您是否有机会使用时间戳策略尝试此操作,很抱歉,我自己没有尝试过此操作,但我相信在2.9.0版中,您应该在阅读卡夫卡约的同时使用此策略


要能够使用自定义时间戳,首先需要通过扩展
timestampolicy

例如:

public class CustomFieldTimePolicy extends TimestampPolicy<String, Foo> {


protected Instant currentWatermark;

public CustomFieldTimePolicy(Optional<Instant> previousWatermark) {
    currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
}


@Override
public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, Foo> record) {
    currentWatermark = new Instant(record.getKV().getValue().getTimestamp());
    return currentWatermark;
}

@Override
public Instant getWatermark(PartitionContext ctx) {
    return currentWatermark;
}
此行负责创建一个新的timestampolicy,传递一个相关的分区和以前的检查点水印


是否有必要使用此CustomFieldTimePolicy,或者我们可以在读取后使用映射来分配时间戳,有什么区别?
withTimestampPolicyFactory(tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))