Java Apache beam中每小时(顺时针)的窗口数据
我试图在DataFlow/ApacheBeam作业中聚合每小时的流数据(如12:00到12:59和01:00到01:59) 下面是我的用例 数据来自pubsub,它有一个时间戳(订单日期)。我想在每一个小时,我得到的订单数,也我想让5个小时的延迟。下面是我正在使用的示例代码Java Apache beam中每小时(顺时针)的窗口数据,java,apache-beam,dataflow,Java,Apache Beam,Dataflow,我试图在DataFlow/ApacheBeam作业中聚合每小时的流数据(如12:00到12:59和01:00到01:59) 下面是我的用例 数据来自pubsub,它有一个时间戳(订单日期)。我想在每一个小时,我得到的订单数,也我想让5个小时的延迟。下面是我正在使用的示例代码 LOG.info("Start Running Pipeline"); DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(ar
LOG.info("Start Running Pipeline");
DataflowPipelineOptions options = PipelineOptionsFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<String> directShipmentFeedData = pipeline.apply("Get Direct Shipment Feed Data", PubsubIO.readStrings().fromSubscription(directShipmentFeedSubscription));
PCollection<String> tibcoRetailOrderConfirmationFeedData = pipeline.apply("Get Tibco Retail Order Confirmation Feed Data", PubsubIO.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));
PCollection<String> flattenData = PCollectionList.of(directShipmentFeedData).and(tibcoRetailOrderConfirmationFeedData)
.apply("Flatten Data from PubSub", Flatten.<String>pCollections());
flattenData
.apply(ParDo.of(new DataParse())).setCoder(SerializableCoder.of(SalesAndUnits.class))
// Adding Window
.apply(
Window.<SalesAndUnits>into(
SlidingWindows.of(Duration.standardMinutes(15))
.every(Duration.standardMinutes(1)))
)
// Data Enrich with Dimensions
.apply(ParDo.of(new DataEnrichWithDimentions()))
// Group And Hourly Sum
.apply(new GroupAndSumSales())
.apply(ParDo.of(new SQLWrite())).setCoder(SerializableCoder.of(SalesAndUnits.class));
pipeline.run();
LOG.info("Finish Running Pipeline");
LOG.info(“开始运行管道”);
DataflowPipelineOptions=PipelineOptionFactory.fromArgs(args).withValidation().as(DataflowPipelineOptions.class);
Pipeline=Pipeline.create(选项);
PCollection directShipmentFeedData=pipeline.apply(“获取直接装运提要数据”,publisubio.readStrings().fromSubscription(directShipmentFeedSubscription));
PCollection tibcoRetailOrderConfirmationFeedData=pipeline.apply(“获取Tibco零售订单确认Feed数据”,publisubio.readStrings().fromSubscription(tibcoRetailOrderConfirmationFeedSubscription));
PCollection flattdata=PCollectionList.of(directShipmentFeedData)和(tibcoRetailOrderConfirmationFeedData)
.apply(“从PubSub展平数据”,展平.pCollections());
扁平化数据
.apply(ParDo.of(new DataParse()).setCoder(SerializableCoder.of(SalesAndUnits.class))
//添加窗口
.申请(
开窗(
滑动窗口(持续时间标准分钟(15))
.每(持续时间.标准分钟(1)))
)
//数据因维度而丰富
.apply(ParDo.of(新DataEnrichWithDimensions()))
//分组和小时总数
.apply(新组和SumSales())
.apply(ParDo.of(new SQLWrite()).setCoder(SerializableCoder.of(SalesAndUnits.class));
pipeline.run();
日志信息(“完成管道运行”);
我会使用一个包含您的需求的窗口。类似于
Window.into(
FixedWindows.of(Duration.standardHours(1))
).withAllowedLateness(Duration.standardHours(5)))
可能后面跟着一个计数
,因为我知道这是您需要的
希望能有帮助