Google cloud dataflow 具有连续批处理的Apache束流管道_Google Cloud Dataflow_Apache Beam Io_Apache Beam

Google cloud dataflow 具有连续批处理的Apache束流管道

google-cloud-dataflow

Google cloud dataflow 具有连续批处理的Apache束流管道,google-cloud-dataflow,apache-beam-io,apache-beam,Google Cloud Dataflow,Apache Beam Io,Apache Beam,我想做什么： PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription", PubsubIO.readMessagesWithAttributes() .fromSubscription("projects/project1/subscriptions/subscription1&q

我想做什么：

         PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
             PubsubIO.readMessagesWithAttributes()
                 .fromSubscription("projects/project1/subscriptions/subscription1"));

         PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
             .of(Duration.standardMinutes(30))));
             
         PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
             ParDo.of(new JsonUnmarshallFn())
                 .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                     TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));

         PCollectionTuple childRecordsTuple = unmarshalResultTuple
             .get(JsonUnmarshallFn.mainOutputTag)
             .apply("FetchChildsFromDBAndProcess",
                 ParDo.of(new ChildsReadFn() )
                     .withOutputTags(ChildsReadFn.mainOutputTag,
                         TupleTagList.of(ChildsReadFn.deadLetterTag)));

         // input is KV of (childId, msgids), output is mutations to write to BT
         PCollectionTuple postProcessTuple = childRecordsTuple
             .get(ChildsReadFn.mainOutputTag)
             .apply(GroupByKey.create())
             .apply("UpdateChildAssociations",
                 ParDo.of(new ChildsProcessorFn())
                     .withOutputTags(ChildsProcessorFn.mutations,
                         TupleTagList.of(ChildsProcessorFn.deadLetterTag)));

         postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);

使用Apache Beam Streaming pipeline和Dataflow Runner从PubSub订阅中使用json消息

将有效负载字符串解组为对象

假设“messageId”是传入消息的唯一Id。例：msgid1、msgid2等

从数据库中检索由#2生成的每个对象的子记录。同一子级可以适用于多条消息

假设“childId”是子记录的唯一Id。例：cid1234、cid1235等

按其唯一id对子记录进行分组，如下例所示

千伏of（cid1234，映射of（msgid1，msgid2））和千伏of（cid1235，映射of（msgid1，msgid2））

将childId级别的分组结果写入数据库

问题：

         PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
             PubsubIO.readMessagesWithAttributes()
                 .fromSubscription("projects/project1/subscriptions/subscription1"));

         PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
             .of(Duration.standardMinutes(30))));
             
         PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
             ParDo.of(new JsonUnmarshallFn())
                 .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                     TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));

         PCollectionTuple childRecordsTuple = unmarshalResultTuple
             .get(JsonUnmarshallFn.mainOutputTag)
             .apply("FetchChildsFromDBAndProcess",
                 ParDo.of(new ChildsReadFn() )
                     .withOutputTags(ChildsReadFn.mainOutputTag,
                         TupleTagList.of(ChildsReadFn.deadLetterTag)));

         // input is KV of (childId, msgids), output is mutations to write to BT
         PCollectionTuple postProcessTuple = childRecordsTuple
             .get(ChildsReadFn.mainOutputTag)
             .apply(GroupByKey.create())
             .apply("UpdateChildAssociations",
                 ParDo.of(new ChildsProcessorFn())
                     .withOutputTags(ChildsProcessorFn.mutations,
                         TupleTagList.of(ChildsProcessorFn.deadLetterTag)));

         postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);

应在何处引入窗口？我们目前在步骤1之后有30分钟的固定窗口

Beam如何定义30分钟窗口的开始和结束时间？是在我们启动管道之后，还是在批处理的第一条消息之后

如果步骤2至步骤5对一个窗口的时间超过1小时，并且下一个窗口批处理已准备就绪，该怎么办。两个windows批处理会并行处理吗

如何使下一个窗口消息等待上一个窗口批处理完成

如果我们不这样做，下一批将覆盖childId级别的结果

代码片段：

         PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
             PubsubIO.readMessagesWithAttributes()
                 .fromSubscription("projects/project1/subscriptions/subscription1"));

         PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
             .of(Duration.standardMinutes(30))));
             
         PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
             ParDo.of(new JsonUnmarshallFn())
                 .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                     TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));

         PCollectionTuple childRecordsTuple = unmarshalResultTuple
             .get(JsonUnmarshallFn.mainOutputTag)
             .apply("FetchChildsFromDBAndProcess",
                 ParDo.of(new ChildsReadFn() )
                     .withOutputTags(ChildsReadFn.mainOutputTag,
                         TupleTagList.of(ChildsReadFn.deadLetterTag)));

         // input is KV of (childId, msgids), output is mutations to write to BT
         PCollectionTuple postProcessTuple = childRecordsTuple
             .get(ChildsReadFn.mainOutputTag)
             .apply(GroupByKey.create())
             .apply("UpdateChildAssociations",
                 ParDo.of(new ChildsProcessorFn())
                     .withOutputTags(ChildsProcessorFn.mutations,
                         TupleTagList.of(ChildsProcessorFn.deadLetterTag)));

         postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);

PCollection messages=pipeline.apply（“ReadPubSubscription”，
publisubio.readMessagesWithAttributes（）
.fromSubscription（“projects/project1/subscriptions/subscription1”）；
PCollection windowedMessages=messages.apply（窗口）到（固定窗口
.的（时长.标准分钟（30））；
PCollectionTuple unmarshalResultTuple=windowedMessages.apply（“UnmarshalJSonString”，
ParDo.of（新的JSONumarshallfn（））
.不带输出标签（jsonumarshallfn.main输出标签，
TupleTagList.of（jsonumarshallfn.deadLetterTag））；
PCollectionTuple childRecordsTuple=unmarshalResultTuple
.get（jsonumarshallfn.mainpoutputtag）
.apply（“FetchChildsFromDBAndProcess”，
（新子女的）伴侣
.带输出标签（ChildsReadFn.main输出标签，
（ChildsReadFn.deadLetterTag））；
//输入为KV of（childId，msgids），输出为写入BT的突变
PCollectionTuple postProcessTuple=childRecordsTuple
.get（ChildsReadFn.main输出ag）
.apply（GroupByKey.create（））
.apply（“UpdateChildAssociations”，
（新子女处理器fn（）的伙伴关系
.不带输出标签（儿童处理器fn.突变，
TupleTagList.of（ChildsProcessorFn.deadLetterTag））；
get（ChildsProcessorFn.mutations）.CloudBigtableIO.write（…）；

解决您的每个问题

关于在Apache Beam中设置窗口时出现的问题1和2，您需要了解“作业之前存在窗口”。我的意思是，windows从UNIX时代开始（时间戳=0）。换句话说，您的数据将在每个固定的时间范围内分配，例如，固定的60秒窗口：

  PCollection<String> items = ...;
    PCollection<String> fixedWindowedItems = items.apply(
        Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));

PCollection items=。。。；
PCollection fixedWindowedItems=items.apply(
进入（固定窗口的持续时间标准秒（60））；

第一个窗口：[0s；59s）-第二个窗口：[60s；120s）…依此类推请参阅文档，以及

关于问题3，Apache Beam中窗口和触发的默认设置是忽略延迟数据。尽管可以使用配置延迟数据的处理。为此，必须了解水印的概念。水印是衡量数据落后多远的指标。例如：您可以使用3秒水印，则如果数据延迟3秒，它将被分配到正确的窗口。另一方面，如果它传递了水印，您可以定义此数据将发生什么，您可以使用重新处理或忽略它

允许迟到

PCollection items=。。。；
PCollection fixedWindowedItems=items.apply(
Window.into（FixedWindows.of（Duration.standardMinutes（1）））
.允许迟到（持续时间：标准天数（2））；

请注意，设置了延迟数据到达的时间量

触发

PCollection pc=。。。；
pc.apply（窗口进入（固定窗口）（1，时间单位分钟））
.triggering（在ProcessingTime.pastFirstElementInPane（）.plusDelayOf（Duration.standardMinutes（1））之后）
.允许迟到（持续时间标准分钟（30））；

请注意，当有延迟数据时，窗口将被重新处理并重新计算事件时间。此触发器使您有机会对延迟数据作出反应

最后，关于问题4，部分用上述概念解释。计算将在每个固定窗口内进行，并在每次触发时重新计算/处理。此逻辑将确保您的数据在正确的窗口中。

解决您的每个问题

关于在Apache Beam中使用窗口时出现的问题1和2，您需要了解“作业之前存在windows”。我的意思是，windows从UNIX时代开始（时间戳=0）。换句话说，您的数据将在每个固定时间范围内分配，例如，固定60秒w