Google cloud dataflow 在输出窗口之前，如何将转换应用于无边界Apache Beam管道窗口中的所有元素？_Google Cloud Dataflow_Apache Beam

Google cloud dataflow 在输出窗口之前，如何将转换应用于无边界Apache Beam管道窗口中的所有元素？

google-cloud-dataflow

Google cloud dataflow 在输出窗口之前，如何将转换应用于无边界Apache Beam管道窗口中的所有元素？,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我正在编写一个数据流管道，它将从Google Pub/Sub读取数据并将数据写入Google云存储： pipeline.apply(marketData) .apply(ParDo.of(new PubsubMessageToByteArray())) .apply(ParDo.of(new ByteArrayToString())) .apply(ParDo.of(new StringToMarketData())) .a

我正在编写一个数据流管道，它将从Google Pub/Sub读取数据并将数据写入Google云存储：

    pipeline.apply(marketData)
        .apply(ParDo.of(new PubsubMessageToByteArray()))
        .apply(ParDo.of(new ByteArrayToString()))
        .apply(ParDo.of(new StringToMarketData()))
        .apply(ParDo.of(new AddTimestamps()))
        .apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
                .withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
                .accumulatingFiredPanes())
        .apply(ParDo.of(new MarketDataToCsv()))
        .apply("Write File(s)", TextIO
                .write()
                .to(options.getOutputDirectory())
                .withWindowedWrites()
                .withNumShards(1)
                .withFilenamePolicy(new WindowedFilenamePolicy(outputBaseDirectory))
                .withHeader(csvHeader));

    pipeline.run().waitUntilFinish();

pipeline.apply（市场数据）
.apply（ParDo.of（新的PubsubMessageToByteArray（））
.apply（ParDo.of（new ByteArrayToString（）））
.apply（ParDo.of（new stringto marketdata（）））
.apply（ParDo.of（new AddTimestamps（）））
.apply（Window.into（FixedWindows.of（Duration.standardMinutes（options.getMinutesPerWindow（）））
.withAllowedLateness（Duration.standardSeconds（options.getAllowedSecondLateness（）））
.累积FiredPanes（））
.apply（ParDo.of（new MarketDataToCsv（）））
.apply（“写入文件”），TextIO
.write（）
.to（options.getOutputDirectory（））
.withWindowedWrites（）
.与努姆沙兹（1）
.withFilenamePolicy（新的WindowedFilenamePolicy（outputBaseDirectory））
.带标题（csvHeader））；
pipeline.run（）.waitUntilFinish（）；

在输出结果之前，我想在窗口中删除重复数据并对元素进行排序。这与典型的pttransform不同，我希望在窗口结束后执行转换

发布/订阅主题将具有重复项，因为如果一个工作线程失败，多个工作线程将生成相同的消息。如何在写入之前删除窗口中的所有重复项？我看到Beam版本0.2中存在一个类，但在当前版本中不存在

我知道在引擎盖下，Beam使工人之间的电压转换平行。但是，由于此管道使用numshards（1）写入

，因此只有一个工作者将写入最终结果。这意味着从理论上讲，应该可以让工作人员在编写之前应用重复数据消除转换
Beam python sdk，因此我可以在Java中重现该逻辑，但为什么要删除它，除非有更好的方法？我可以想象，实现将是一个重复数据消除ParDo，它是在某个窗口触发后执行的
编辑：看起来他们会满足我的需要。我现在正在尝试使用这些
 以下是重复数据消除部分的答案：
.apply(Distinct
 // MarketData::key produces a String. Use withRepresentativeValue() 
 // because Apache beam deserializes Java objects into bytes, which 
 // could cause two equal objects to be interpreted as not equal. See 
 // org/apache/beam/sdk/transforms/Distinct.java for details. 
 .withRepresentativeValueFn(MarketData::key)
 .withRepresentativeType(TypeDescriptor.of(String.class)))

这里有一个用于对元素进行排序和重复数据消除的解决方案（如果还需要排序）：
公共静态类DedupAndSortByTime扩展
Combine.CombineFn{
@凌驾
公共树集createAccumulator（）{
返回新树集（比较器
.comparingLong（MarketData:：getEventTime）
.Then比较（MarketData:：getOrderbookType））；
}
@凌驾
公共树集附加输入（树集累计、市场数据输入）{
累加（输入）；
返回累计；
}
@凌驾
公用树集合累加器（可累加）{
TreeSet merged=createAccumulator（）；
用于（树集累计：累计）{
合并。添加全部（累计）；
}
返回合并；
}
@凌驾
公共列表提取输出（TreeSet accum）{
return list.newArrayList（acum.iterator（））；
}
}

因此，更新的管道是
    // Pipeline
    pipeline.apply(marketData)
        .apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray()))
        .apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
        .apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.AddTimestamps()))
        .apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
                .withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
                .accumulatingFiredPanes())
        .apply(Combine.globally(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
        .apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv()))
        .apply("Write File(s)", TextIO
                .write()
                // This doesn't set the output directory as expected. 
                // "/output" gets stripped and I don't know why,
                // so "/output" has to be added to the directory path 
                // within the FilenamePolicy.
                .to(options.getOutputDirectory())
                .withWindowedWrites()
                .withNumShards(1)
                .withFilenamePolicy(new MarketDataFilenamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
                .withHeader(csvHeader));

    pipeline.run().waitUntilFinish();

//管道
管道。应用（marketData）
.apply（ParDo.of（new MarketDataDoFns.PubsubMessageToByteArray（））
.apply（ParDo.of（new MarketDataDoFns.ByteArrayToString（）））
.apply（ParDo.of（new MarketDataDoFns.StringToMarketDataAggregate（））
.apply（ParDo.of（new MarketDataDoFns.DenormalizeMarketDataAggregate（））
.apply（ParDo.of（new MarketDataDoFns.AddTimestamps（））
.apply（Window.into（FixedWindows.of（Duration.standardMinutes（options.getMinutesPerWindow（）））
.withAllowedLateness（Duration.standardSeconds（options.getAllowedSecondLateness（）））
.累积FiredPanes（））
.apply（Combine.global（new MarketDataCombineFn.DedupAndSortByTime（））.withoutDefaults（））
.apply（ParDo.of（new MarketDataDoFns.MarketDataToCsv（））
.apply（“写入文件”），TextIO
.write（）
//这没有按预期设置输出目录。
//“/output”被剥离，我不知道为什么，
//因此，必须将“/output”添加到目录路径中
//在FilenamePolicy中。
.to（options.getOutputDirectory（））
.withWindowedWrites（）
.与努姆沙兹（1）
.withFilenamePolicy（新市场数据FileNamePolicy.WindowedFilenamePolicy（outputBaseDirectory））
.带标题（csvHeader））；
pipeline.run（）.waitUntilFinish（）；

    // Pipeline
    pipeline.apply(marketData)
        .apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray()))
        .apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
        .apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.AddTimestamps()))
        .apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
                .withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
                .accumulatingFiredPanes())
        .apply(Combine.globally(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
        .apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv()))
        .apply("Write File(s)", TextIO
                .write()
                // This doesn't set the output directory as expected. 
                // "/output" gets stripped and I don't know why,
                // so "/output" has to be added to the directory path 
                // within the FilenamePolicy.
                .to(options.getOutputDirectory())
                .withWindowedWrites()
                .withNumShards(1)
                .withFilenamePolicy(new MarketDataFilenamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
                .withHeader(csvHeader));

    pipeline.run().waitUntilFinish();