Google cloud dataflow 在输出窗口之前,如何将转换应用于无边界Apache Beam管道窗口中的所有元素?
我正在编写一个数据流管道,它将从Google Pub/Sub读取数据并将数据写入Google云存储:Google cloud dataflow 在输出窗口之前,如何将转换应用于无边界Apache Beam管道窗口中的所有元素?,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我正在编写一个数据流管道,它将从Google Pub/Sub读取数据并将数据写入Google云存储: pipeline.apply(marketData) .apply(ParDo.of(new PubsubMessageToByteArray())) .apply(ParDo.of(new ByteArrayToString())) .apply(ParDo.of(new StringToMarketData())) .a
pipeline.apply(marketData)
.apply(ParDo.of(new PubsubMessageToByteArray()))
.apply(ParDo.of(new ByteArrayToString()))
.apply(ParDo.of(new StringToMarketData()))
.apply(ParDo.of(new AddTimestamps()))
.apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
.withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
.accumulatingFiredPanes())
.apply(ParDo.of(new MarketDataToCsv()))
.apply("Write File(s)", TextIO
.write()
.to(options.getOutputDirectory())
.withWindowedWrites()
.withNumShards(1)
.withFilenamePolicy(new WindowedFilenamePolicy(outputBaseDirectory))
.withHeader(csvHeader));
pipeline.run().waitUntilFinish();
pipeline.apply(市场数据)
.apply(ParDo.of(新的PubsubMessageToByteArray())
.apply(ParDo.of(new ByteArrayToString()))
.apply(ParDo.of(new stringto marketdata()))
.apply(ParDo.of(new AddTimestamps()))
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow()))
.withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
.累积FiredPanes())
.apply(ParDo.of(new MarketDataToCsv()))
.apply(“写入文件”),TextIO
.write()
.to(options.getOutputDirectory())
.withWindowedWrites()
.与努姆沙兹(1)
.withFilenamePolicy(新的WindowedFilenamePolicy(outputBaseDirectory))
.带标题(csvHeader));
pipeline.run().waitUntilFinish();
在输出结果之前,我想在窗口中删除重复数据并对元素进行排序。这与典型的pttransform不同,我希望在窗口结束后执行转换
发布/订阅主题将具有重复项,因为如果一个工作线程失败,多个工作线程将生成相同的消息。如何在写入之前删除窗口中的所有重复项?我看到Beam版本0.2中存在一个类,但在当前版本中不存在
我知道在引擎盖下,Beam使工人之间的电压转换平行。但是,由于此管道使用numshards(1)写入,因此只有一个工作者将写入最终结果。这意味着从理论上讲,应该可以让工作人员在编写之前应用重复数据消除转换
Beam python sdk,因此我可以在Java中重现该逻辑,但为什么要删除它,除非有更好的方法?我可以想象,实现将是一个重复数据消除ParDo,它是在某个窗口触发后执行的
编辑:看起来他们会满足我的需要。我现在正在尝试使用这些 以下是重复数据消除部分的答案:
.apply(Distinct
// MarketData::key produces a String. Use withRepresentativeValue()
// because Apache beam deserializes Java objects into bytes, which
// could cause two equal objects to be interpreted as not equal. See
// org/apache/beam/sdk/transforms/Distinct.java for details.
.withRepresentativeValueFn(MarketData::key)
.withRepresentativeType(TypeDescriptor.of(String.class)))
这里有一个用于对元素进行排序和重复数据消除的解决方案(如果还需要排序):
公共静态类DedupAndSortByTime扩展
Combine.CombineFn{
@凌驾
公共树集createAccumulator(){
返回新树集(比较器
.comparingLong(MarketData::getEventTime)
.Then比较(MarketData::getOrderbookType));
}
@凌驾
公共树集附加输入(树集累计、市场数据输入){
累加(输入);
返回累计;
}
@凌驾
公用树集合累加器(可累加){
TreeSet merged=createAccumulator();
用于(树集累计:累计){
合并。添加全部(累计);
}
返回合并;
}
@凌驾
公共列表提取输出(TreeSet accum){
return list.newArrayList(acum.iterator());
}
}
因此,更新的管道是
// Pipeline
pipeline.apply(marketData)
.apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray()))
.apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
.apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate()))
.apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate()))
.apply(ParDo.of(new MarketDataDoFns.AddTimestamps()))
.apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
.withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
.accumulatingFiredPanes())
.apply(Combine.globally(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
.apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv()))
.apply("Write File(s)", TextIO
.write()
// This doesn't set the output directory as expected.
// "/output" gets stripped and I don't know why,
// so "/output" has to be added to the directory path
// within the FilenamePolicy.
.to(options.getOutputDirectory())
.withWindowedWrites()
.withNumShards(1)
.withFilenamePolicy(new MarketDataFilenamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
.withHeader(csvHeader));
pipeline.run().waitUntilFinish();
//管道
管道。应用(marketData)
.apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray())
.apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
.apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate())
.apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate())
.apply(ParDo.of(new MarketDataDoFns.AddTimestamps())
.apply(Window.into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow()))
.withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
.累积FiredPanes())
.apply(Combine.global(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
.apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv())
.apply(“写入文件”),TextIO
.write()
//这没有按预期设置输出目录。
//“/output”被剥离,我不知道为什么,
//因此,必须将“/output”添加到目录路径中
//在FilenamePolicy中。
.to(options.getOutputDirectory())
.withWindowedWrites()
.与努姆沙兹(1)
.withFilenamePolicy(新市场数据FileNamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
.带标题(csvHeader));
pipeline.run().waitUntilFinish();
// Pipeline
pipeline.apply(marketData)
.apply(ParDo.of(new MarketDataDoFns.PubsubMessageToByteArray()))
.apply(ParDo.of(new MarketDataDoFns.ByteArrayToString()))
.apply(ParDo.of(new MarketDataDoFns.StringToMarketDataAggregate()))
.apply(ParDo.of(new MarketDataDoFns.DenormalizeMarketDataAggregate()))
.apply(ParDo.of(new MarketDataDoFns.AddTimestamps()))
.apply(Window.<MarketData>into(FixedWindows.of(Duration.standardMinutes(options.getMinutesPerWindow())))
.withAllowedLateness(Duration.standardSeconds(options.getAllowedSecondLateness()))
.accumulatingFiredPanes())
.apply(Combine.globally(new MarketDataCombineFn.DedupAndSortByTime()).withoutDefaults())
.apply(ParDo.of(new MarketDataDoFns.MarketDataToCsv()))
.apply("Write File(s)", TextIO
.write()
// This doesn't set the output directory as expected.
// "/output" gets stripped and I don't know why,
// so "/output" has to be added to the directory path
// within the FilenamePolicy.
.to(options.getOutputDirectory())
.withWindowedWrites()
.withNumShards(1)
.withFilenamePolicy(new MarketDataFilenamePolicy.WindowedFilenamePolicy(outputBaseDirectory))
.withHeader(csvHeader));
pipeline.run().waitUntilFinish();