Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/cmake/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Google cloud dataflow 在输出窗口之前,如何将转换应用于无边界Apache Beam管道窗口中的所有元素?_Google Cloud Dataflow_Apache Beam - Fatal编程技术网

Google cloud dataflow 在输出窗口之前,如何将转换应用于无边界Apache Beam管道窗口中的所有元素?

Google cloud dataflow 在输出窗口之前,如何将转换应用于无边界Apache Beam管道窗口中的所有元素?,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我正在编写一个数据流管道,它将从Google Pub/Sub读取数据并将数据写入Google云存储: pipeline.apply(marketData) .apply(ParDo.of(new PubsubMessageToByteArray())) .apply(ParDo.of(new ByteArrayToString())) .apply(ParDo.of(new StringToMarketData())) .a

我正在编写一个数据流管道,它将从Google Pub/Sub读取数据并将数据写入Google云存储:

    pipeline.apply(marketData)
        .apply(ParDo.of(new PubsubMessageToByteArray()))
        .apply(ParDo.of(new ByteArrayToString()))
        .apply(ParDo.of(new StringToMarketData()))
        .apply(ParDo.of(new AddTimestamps()))
        .apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
                .withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
                .accumulatingFiredPanes())
        .apply(ParDo.of(new MarketDataToCsv()))
        .apply("Write File(s)", TextIO
                .write()
                .to(options.getOutputDirectory())
                .withWindowedWrites()
                .withNumShards(1)
                .withFilenamePolicy(new WindowedFilenamePolicy(outputBaseDirectory))
                .withHeader(csvHeader));

    pipeline.run().waitUntilFinish();
pipeline.apply(市场数据)
.apply(ParDo.of(新的PubsubMessageToByteArray())
.apply(ParDo.of(new ByteArrayToString()))
.apply(ParDo.of(new stringto marketdata()))
.apply(ParDo.of(new AddTimestamps()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow()))
.withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
.累积FiredPanes())
.apply(ParDo.of(new MarketDataToCsv()))
.apply(“写入文件”),TextIO
.write()
.to(options.getOutputDirectory())
.withWindowedWrites()
.与努姆沙兹(1)
.withFilenamePolicy(新的WindowedFilenamePolicy(outputBaseDirectory))
.带标题(csvHeader));
pipeline.run().waitUntilFinish();
在输出结果之前,我想在窗口中删除重复数据并对元素进行排序。这与典型的pttransform不同,我希望在窗口结束后执行转换

发布/订阅主题将具有重复项,因为如果一个工作线程失败,多个工作线程将生成相同的消息。如何在写入之前删除窗口中的所有重复项?我看到Beam版本0.2中存在一个类,但在当前版本中不存在

我知道在引擎盖下,Beam使工人之间的电压转换平行。但是,由于此管道使用numshards(1)写入
,因此只有一个工作者将写入最终结果。这意味着从理论上讲,应该可以让工作人员在编写之前应用重复数据消除转换

Beam python sdk,因此我可以在Java中重现该逻辑,但为什么要删除它,除非有更好的方法?我可以想象,实现将是一个重复数据消除ParDo,它是在某个窗口触发后执行的


编辑:看起来他们会满足我的需要。我现在正在尝试使用这些

以下是重复数据消除部分的答案:

.apply(Distinct
 // MarketData::key produces a String. Use withRepresentativeValue() 
 // because Apache beam deserializes Java objects into bytes, which 
 // could cause two equal objects to be interpreted as not equal. See 
 // org/apache/beam/sdk/transforms/Distinct.java for details. 
 .withRepresentativeValueFn(MarketData::key)
 .withRepresentativeType(TypeDescriptor.of(String.class)))
这里有一个用于对元素进行排序和重复数据消除的解决方案(如果还需要排序):

公共静态类DedupAndSortByTime扩展
Combine.CombineFn{
@凌驾
公共树集createAccumulator(){
返回新树集(比较器
.comparingLong(MarketData::getEventTime)
.Then比较(MarketData::getOrderbookType));
}
@凌驾
公共树集附加输入(树集累计、市场数据输入){
累加(输入);
返回累计;
}
@凌驾
公用树集合累加器(可累加){
TreeSet merged=createAccumulator();
用于(树集累计:累计){
合并。添加全部(累计);
}
返回合并;
}
@凌驾
公共列表提取输出(TreeSet accum){
return list.newArrayList(acum.iterator());
}
}
因此,更新的管道是

    // Pipeline
    pipeline.apply(marketData)
        .apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray()))
        .apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
        .apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.AddTimestamps()))
        .apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
                .withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
                .accumulatingFiredPanes())
        .apply(Combine.globally(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
        .apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv()))
        .apply("Write File(s)", TextIO
                .write()
                // This doesn't set the output directory as expected. 
                // "/output" gets stripped and I don't know why,
                // so "/output" has to be added to the directory path 
                // within the FilenamePolicy.
                .to(options.getOutputDirectory())
                .withWindowedWrites()
                .withNumShards(1)
                .withFilenamePolicy(new MarketDataFilenamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
                .withHeader(csvHeader));

    pipeline.run().waitUntilFinish();
//管道
管道。应用(marketData)
.apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray())
.apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
.apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate())
.apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate())
.apply(ParDo.of(new MarketDataDoFns.AddTimestamps())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow()))
.withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
.累积FiredPanes())
.apply(Combine.global(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
.apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv())
.apply(“写入文件”),TextIO
.write()
//这没有按预期设置输出目录。
//“/output”被剥离,我不知道为什么,
//因此,必须将“/output”添加到目录路径中
//在FilenamePolicy中。
.to(options.getOutputDirectory())
.withWindowedWrites()
.与努姆沙兹(1)
.withFilenamePolicy(新市场数据FileNamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
.带标题(csvHeader));
pipeline.run().waitUntilFinish();
    // Pipeline
    pipeline.apply(marketData)
        .apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray()))
        .apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
        .apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate()))
        .apply(ParDo.of(new MarketDataDoFns.AddTimestamps()))
        .apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
                .withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
                .accumulatingFiredPanes())
        .apply(Combine.globally(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
        .apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv()))
        .apply("Write File(s)", TextIO
                .write()
                // This doesn't set the output directory as expected. 
                // "/output" gets stripped and I don't know why,
                // so "/output" has to be added to the directory path 
                // within the FilenamePolicy.
                .to(options.getOutputDirectory())
                .withWindowedWrites()
                .withNumShards(1)
                .withFilenamePolicy(new MarketDataFilenamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
                .withHeader(csvHeader));

    pipeline.run().waitUntilFinish();