Streaming Apache Beam:使用Withtimestamp分配事件时间时出错
我有一个无限卡夫卡流发送数据与下列字段Streaming Apache Beam:使用Withtimestamp分配事件时间时出错,streaming,apache-beam,Streaming,Apache Beam,我有一个无限卡夫卡流发送数据与下列字段 {"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"} 我使用ApacheBeamSDK for kafka读取流 import org.apache.beam.sdk.io.kafka.KafkaIO; pipeline.apply(KafkaIO.<Long, String>read() .withBo
{"identifier": "xxx", "value": 10.0, "ts":"2019-01-16T10:51:26.326242+0000"}
我使用ApacheBeamSDK for kafka读取流
import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.<Long, String>read()
.withBootstrapServers("kafka:9092")
.withTopic("test")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of("enable.auto.commit", "true"))
.updateConsumerProperties(ImmutableMap.of("group.id", "Consumer1"))
.commitOffsetsInFinalize()
.withoutMetadata()))
import org.apache.beam.sdk.io.kafka.KafkaIO;
pipeline.apply(KafkaIO.read())
.使用BootstrapServer(“卡夫卡:9092”)
.withTopic(“测试”)
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
.updateConsumerProperties(ImmutableMap.of(“enable.auto.commit”、“true”))
.updateConsumerProperties(ImmutableMap.of(“group.id”,“Consumer1”))
.CommitteofSetsinFinalize()
.withoutMetadata())
因为我想使用事件时间(在我的示例中是“ts”)打开窗口,所以我解析传入字符串并将传入数据流的“ts”字段指定为时间戳
PCollection<Temperature> tempCollection = p.apply(new SetupKafka())
.apply(ParDo.of(new ReadFromTopic()))
.apply("ParseTemperature", ParDo.of(new ParseTemperature()));
tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));
PCollection tempCollection=p.apply(新设置kafka())
.apply(ParDo.of(new ReadFromTopic()))
.apply(“ParseTemperature”,ParDo.of(new ParseTemperature()));
apply(“AssignTimeStamps”,WithTimestamps.of(us->newinstant(us.getTimestamp()));
窗口函数和计算应用如下:
PCollection<Output> output = tempCollection.apply(Window
.<Temperature>into(FixedWindows.of(Duration.standardSeconds(30)))
.triggering(AfterWatermark.pastEndOfWindow()
.withLateFirings(AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))))
.withAllowedLateness(Duration.standardDays(1))
.accumulatingFiredPanes())
.apply(new ComputeMax());
PCollection输出=tempCollection.apply(窗口
.into(固定窗口数(持续时间标准秒数(30)))
.triggering(AfterWatermark.pastEndOfWindow()后)
.withLateFirings(在processingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(10))之后)
.允许迟到(持续时间.标准天数(1))
.累积FiredPanes())
.apply(新的ComputeMax());
由于在实际场景中,事件时间戳通常早于处理时间戳,因此我以5秒的延迟将数据流传输到输入流中
PCollection<Temperature> tempCollection = p.apply(new SetupKafka())
.apply(ParDo.of(new ReadFromTopic()))
.apply("ParseTemperature", ParDo.of(new ParseTemperature()));
tempCollection.apply("AssignTimeStamps", WithTimestamps.of(us -> new Instant(us.getTimestamp())));
我得到以下错误:
无法使用时间戳2019-01-16T11:15:45.560Z输出。输出
时间戳不得早于当前输入的时间戳
(2019-01-16T11:16:50.640Z)减去允许的倾斜(0毫秒)。
有关更改的详细信息,请参阅DoFn#getAllowedTimestampSkew()Javadoc
允许的倾斜
如果我注释掉AssignTimeStamps的行,则没有错误,但我猜,这是在考虑处理时间
如何确保我的计算和窗口基于事件时间而不是处理时间
请提供一些有关如何处理此场景的信息。您是否有机会使用时间戳策略尝试此操作,很抱歉,我自己没有尝试过此操作,但我相信在2.9.0版中,您应该在阅读卡夫卡约的同时使用此策略
要能够使用自定义时间戳,首先需要通过扩展
timestampolicy
例如:
public class CustomFieldTimePolicy extends TimestampPolicy<String, Foo> {
protected Instant currentWatermark;
public CustomFieldTimePolicy(Optional<Instant> previousWatermark) {
currentWatermark = previousWatermark.orElse(BoundedWindow.TIMESTAMP_MIN_VALUE);
}
@Override
public Instant getTimestampForRecord(PartitionContext ctx, KafkaRecord<String, Foo> record) {
currentWatermark = new Instant(record.getKV().getValue().getTimestamp());
return currentWatermark;
}
@Override
public Instant getWatermark(PartitionContext ctx) {
return currentWatermark;
}
此行负责创建一个新的timestampolicy,传递一个相关的分区和以前的检查点水印
是否有必要使用此CustomFieldTimePolicy,或者我们可以在读取后使用映射来分配时间戳,有什么区别?
withTimestampPolicyFactory(tp, previousWatermark) -> new CustomFieldTimePolicy(previousWatermark))