Google cloud dataflow 当CoGroupByKey与CalendarWindows一起使用时,用于展平的输入具有不兼容的窗口fn

Google cloud dataflow 当CoGroupByKey与CalendarWindows一起使用时,用于展平的输入具有不兼容的窗口fn,google-cloud-dataflow,Google Cloud Dataflow,TL;医生: 如何使用与CalendarWindows设置的相同窗口策略对一组PCollections进行组合键 长版本 我正在编写一个数据流管道,它从两个不同的pub/sub读取数据,其中一个PCollections被拆分为一个PCollectionTuple,最后我尝试在将其保存到BigQuery之前执行此操作 在管道测试期间,我的PCollections窗口策略是: private static PCollection<KV<String, Long>> apply

TL;医生:

如何使用与CalendarWindows设置的相同窗口策略对一组PCollections进行组合键

长版本

我正在编写一个数据流管道,它从两个不同的pub/sub读取数据,其中一个PCollections被拆分为一个PCollectionTuple,最后我尝试在将其保存到BigQuery之前执行此操作

在管道测试期间,我的PCollections窗口策略是:

private static PCollection<KV<String, Long>> applyWindowsAndCount(final PCollection<KV<String, Long>> summary, final String OperationName){
    return summary
            .apply("Apply Windows " + OperationName, Window
                    .<KV<String, Long>>into(FixedWindows.of(Duration.standardMinutes(1))) 
                    .discardingFiredPanes()
                    .withAllowedLateness(Duration.standardDays(1)))
            .apply("Count " + OperationName, Count.perKey());
}
在阅读文档时,我发现:

使用CoGroupByKey对应用了窗口策略的PCollection进行分组时,所有要分组的PCollection必须使用相同的窗口策略和窗口大小。例如,要合并的所有集合必须使用(假设)相同的5分钟固定窗口或4分钟滑动窗口,每30秒开始一次

如果管道尝试使用CoGroupByKey将PCollections与不兼容的窗口合并,则在构建管道时,数据流将生成IllegalStateException错误


很明显,dataflow认为我的PCollections具有不兼容的窗口,但是,所有这些窗口都是使用我以前复制的函数应用的。那么,我怎样才能用CalendarWindows设置的相同窗口策略将一组PCollections组合起来呢?

看起来这是CalendarWindows中的一个bug;要解决此问题,您可以创建单个CalendarWindows对象,并将其用作每个PCollection的窗口fn,而不是为每个PCollection创建单独的CalendarWindows对象

private static PCollection<KV<String, CoGbkResult>> MergeSummary(PCollection<KV<String, Long>> Avail, PCollection<KV<String, Long>> ValuationOK, PCollection<KV<String, Long>> ValuationKO){
    return KeyedPCollectionTuple.of(Util.AVAIL, Avail)
                                .and(Util.VALUATION_OK, ValuationOK)
                                .and(Util.VALUATION_KO, ValuationKO)
                                .apply("Merge Summary", CoGroupByKey.create());
}
private static PCollection<KV<String, Long>> applyWindowsAndCount(final PCollection<KV<String, Long>> summary, final String OperationName){
        return summary
                .apply("Apply Windows " + OperationName, Window
                                .<KV<String, Long>>into(CalendarWindows.days(1).withTimeZone(DateTimeZone.UTC).withStartingDay(2016,9,20)) //Per day windowing.                                    
                                .discardingFiredPanes()
                                .withAllowedLateness(Duration.standardDays(1))) //Accepts X days late data.
                .apply("Count " + OperationName, Count.perKey());
    }
Exception in thread "main" java.lang.IllegalStateException: Inputs to Flatten had incompatible window windowFns: com.google.cloud.dataflow.sdk.transforms.windowing.CalendarWindows$DaysWindows@6af9fcb2, com.google.cloud.dataflow.sdk.transforms.windowing.CalendarWindows$DaysWindows@6cce16f4
at com.google.cloud.dataflow.sdk.transforms.Flatten$FlattenPCollectionList.apply(Flatten.java:121)
at com.google.cloud.dataflow.sdk.transforms.Flatten$FlattenPCollectionList.apply(Flatten.java:105)
at com.google.cloud.dataflow.sdk.runners.PipelineRunner.apply(PipelineRunner.java:74)
at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.apply(DataflowPipelineRunner.java:413)
at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:367)
at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:274)
at com.google.cloud.dataflow.sdk.values.PCollectionList.apply(PCollectionList.java:175)
at com.google.cloud.dataflow.sdk.transforms.join.CoGroupByKey.apply(CoGroupByKey.java:124)
at com.google.cloud.dataflow.sdk.transforms.join.CoGroupByKey.apply(CoGroupByKey.java:74)
at com.google.cloud.dataflow.sdk.runners.PipelineRunner.apply(PipelineRunner.java:74)
at com.google.cloud.dataflow.sdk.runners.DataflowPipelineRunner.apply(DataflowPipelineRunner.java:413)
at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:367)
at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:290)
at com.google.cloud.dataflow.sdk.transforms.join.KeyedPCollectionTuple.apply(KeyedPCollectionTuple.java:116)