Google cloud dataflow 仅在窗格完成后触发数据流

Google cloud dataflow 仅在窗格完成后触发数据流,google-cloud-dataflow,Google Cloud Dataflow,有没有一种方法可以设置一个只在窗格完成时触发一次的触发器?我所说的“完整”是指水印超过窗口的末尾加上任何允许的延迟。我不想在那之前触发任何中间触发器。 我目前试图“伪造”此行为的方法是设置。withAllowedLateness(Duration.standardHours(1),ClosingBehavior.FIRE\u ALWAYS)),然后通过检查if(c.pane().isLast()){… 或者更准确地说,大致如下: Pipeline p = Pipeline.create(Pipe

有没有一种方法可以设置一个只在窗格完成时触发一次的触发器?我所说的“完整”是指水印超过窗口的末尾加上任何允许的延迟。我不想在那之前触发任何中间触发器。 我目前试图“伪造”此行为的方法是设置
。withAllowedLateness(Duration.standardHours(1),ClosingBehavior.FIRE\u ALWAYS))
,然后通过检查
if(c.pane().isLast()){…
或者更准确地说,大致如下:

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(args).withValidation().create());
p.apply(PubsubIO.Read.named("ReadFromPubsub").timestampLabel("myts").subscription(INPUT_TOPIC))
.apply("Window", Window.<String>into(Sessions.withGapDuration(Duration.standardMinutes(5)))
    .accumulatingFiredPanes()
    .withAllowedLateness(Duration.standardHours(1), ClosingBehavior.FIRE_ALWAYS))
.apply("Combine", Combine.<String, Metric>perKey(Foo.Merge))
.apply(ParDo.named("FilterComplete").of(Foo.FilterComplete));
虽然这种方法似乎有效,但过滤掉所有未使用的触发器似乎是浪费资源。但更重要的是,如果让流式作业运行多天,它会开始抛出
java.lang.IllegalStateException:Garbage collection hold…
异常,因此我正在寻找重新考虑的方法

完全例外情况如下:

java.lang.IllegalStateException: Garbage collection hold 2017-07-16T14:55:43.999Z cannot be before input watermark 2017-07-16T15:34:15.000Z
at com.google.cloud.dataflow.worker.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.addGarbageCollectionHold(DataflowWatermarkHold.java:402)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.addEndOfWindowOrGarbageCollectionHolds(DataflowWatermarkHold.java:279)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.access$000(DataflowWatermarkHold.java:55)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold$1.read(DataflowWatermarkHold.java:534)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold$1.read(DataflowWatermarkHold.java:486)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.onTrigger(DataflowReduceFnRunner.java:971)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.emit(DataflowReduceFnRunner.java:902)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.onTimers(DataflowReduceFnRunner.java:765)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowGABWViaWindowSetFn.processElement(DataflowGABWViaWindowSetFn.java:89)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
at com.google.cloud.dataflow.sdk.util.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:67)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerLoggingParDoFn.processElement(DataflowWorkerLoggingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:55)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:221)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:719)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$600(StreamingDataflowWorker.java:95)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$8.run(StreamingDataflowWorker.java:801)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

你能分享整个例外情况吗?你能详细说明为什么在水印过去后要等待更长时间吗?如果你这样做是因为水印估计进展太快(从而导致太多数据延迟)您可以考虑修改源,使水印估计基本上慢一个小时。然后,您不需要任何允许的延迟,您只需在水印通过窗口末尾时触发。我更新了整个异常的问题。我想等待的原因是我需要支持延迟数据(日志文件并不总是准时到达),如果我只有一条记录,保证所有内容(包括后期数据),那么下游处理数据就容易多了,而不是获取需要某种upsert逻辑来处理的单个记录的离散或累积片段。这是需要打开一个窗口,直到我可以确定所有日志数据都已到达为止。@BenChambers只是想跟进触发问题。是否有一个触发器会在发生水灾时触发rk超过了窗口结束时间加上允许的延迟时间?谢谢你的帮助。目前没有。一般来说,你不应该使用触发器来减少延迟数据量。如果水印估计产生了太多的延迟数据,你应该调整它,以便当水印超过窗口结束时间时,lat的可能性更小e数据。你最后是如何处理的?你能分享整个例外情况吗?你能详细说明为什么你想在水印过去后等待更长时间吗?如果你这样做是因为水印估计进展太快(从而导致太多数据延迟)您可以考虑修改源,使水印估计基本上慢一个小时。然后,您不需要任何允许的延迟,您只需在水印通过窗口末尾时触发。我更新了整个异常的问题。我想等待的原因是我需要支持延迟数据(日志文件并不总是准时到达),如果我只有一条记录,保证所有内容(包括后期数据),那么下游处理数据就容易多了,而不是获取需要某种upsert逻辑来处理的单个记录的离散或累积片段。这是需要打开一个窗口,直到我可以确定所有日志数据都已到达为止。@BenChambers只是想跟进触发问题。是否有一个触发器会在发生水灾时触发rk超过了窗口结束时间加上允许的延迟时间?谢谢你的帮助。目前没有。一般来说,你不应该使用触发器来减少延迟数据量。如果水印估计产生了太多的延迟数据,你应该调整它,以便当水印超过窗口结束时间时,lat的可能性更小e数据。最后你是如何处理的?
java.lang.IllegalStateException: Garbage collection hold 2017-07-16T14:55:43.999Z cannot be before input watermark 2017-07-16T15:34:15.000Z
at com.google.cloud.dataflow.worker.repackaged.com.google.common.base.Preconditions.checkState(Preconditions.java:199)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.addGarbageCollectionHold(DataflowWatermarkHold.java:402)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.addEndOfWindowOrGarbageCollectionHolds(DataflowWatermarkHold.java:279)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold.access$000(DataflowWatermarkHold.java:55)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold$1.read(DataflowWatermarkHold.java:534)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWatermarkHold$1.read(DataflowWatermarkHold.java:486)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.onTrigger(DataflowReduceFnRunner.java:971)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.emit(DataflowReduceFnRunner.java:902)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowReduceFnRunner.onTimers(DataflowReduceFnRunner.java:765)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowGABWViaWindowSetFn.processElement(DataflowGABWViaWindowSetFn.java:89)
at com.google.cloud.dataflow.sdk.util.SimpleDoFnRunner.invokeProcessElement(SimpleDoFnRunner.java:49)
at com.google.cloud.dataflow.sdk.util.DoFnRunnerBase.processElement(DoFnRunnerBase.java:139)
at com.google.cloud.dataflow.sdk.util.LateDataDroppingDoFnRunner.processElement(LateDataDroppingDoFnRunner.java:67)
at com.google.cloud.dataflow.sdk.runners.worker.SimpleParDoFn.processElement(SimpleParDoFn.java:188)
at com.google.cloud.dataflow.sdk.runners.worker.ForwardingParDoFn.processElement(ForwardingParDoFn.java:42)
at com.google.cloud.dataflow.sdk.runners.worker.DataflowWorkerLoggingParDoFn.processElement(DataflowWorkerLoggingParDoFn.java:47)
at com.google.cloud.dataflow.sdk.util.common.worker.ParDoOperation.process(ParDoOperation.java:55)
at com.google.cloud.dataflow.sdk.util.common.worker.OutputReceiver.process(OutputReceiver.java:52)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.runReadLoop(ReadOperation.java:221)
at com.google.cloud.dataflow.sdk.util.common.worker.ReadOperation.start(ReadOperation.java:182)
at com.google.cloud.dataflow.sdk.util.common.worker.MapTaskExecutor.execute(MapTaskExecutor.java:69)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.process(StreamingDataflowWorker.java:719)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker.access$600(StreamingDataflowWorker.java:95)
at com.google.cloud.dataflow.sdk.runners.worker.StreamingDataflowWorker$8.run(StreamingDataflowWorker.java:801)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)