Google cloud dataflow 具有连续批处理的Apache束流管道

Google cloud dataflow 具有连续批处理的Apache束流管道,google-cloud-dataflow,apache-beam-io,apache-beam,Google Cloud Dataflow,Apache Beam Io,Apache Beam,我想做什么: PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription", PubsubIO.readMessagesWithAttributes() .fromSubscription("projects/project1/subscriptions/subscription1&q

我想做什么:

         PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
             PubsubIO.readMessagesWithAttributes()
                 .fromSubscription("projects/project1/subscriptions/subscription1"));

         PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
             .of(Duration.standardMinutes(30))));
             
         PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
             ParDo.of(new JsonUnmarshallFn())
                 .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                     TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));

         PCollectionTuple childRecordsTuple = unmarshalResultTuple
             .get(JsonUnmarshallFn.mainOutputTag)
             .apply("FetchChildsFromDBAndProcess",
                 ParDo.of(new ChildsReadFn() )
                     .withOutputTags(ChildsReadFn.mainOutputTag,
                         TupleTagList.of(ChildsReadFn.deadLetterTag)));

         // input is KV of (childId, msgids), output is mutations to write to BT
         PCollectionTuple postProcessTuple = childRecordsTuple
             .get(ChildsReadFn.mainOutputTag)
             .apply(GroupByKey.create())
             .apply("UpdateChildAssociations",
                 ParDo.of(new ChildsProcessorFn())
                     .withOutputTags(ChildsProcessorFn.mutations,
                         TupleTagList.of(ChildsProcessorFn.deadLetterTag)));

         postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
  • 使用Apache Beam Streaming pipeline和Dataflow Runner从PubSub订阅中使用json消息

  • 将有效负载字符串解组为对象

    • 假设“messageId”是传入消息的唯一Id。例:msgid1、msgid2等
  • 从数据库中检索由#2生成的每个对象的子记录。同一子级可以适用于多条消息

    • 假设“childId”是子记录的唯一Id。例:cid1234、cid1235等
  • 按其唯一id对子记录进行分组,如下例所示

    • 千伏of(cid1234,映射of(msgid1,msgid2))和千伏of(cid1235,映射of(msgid1,msgid2))
  • 将childId级别的分组结果写入数据库

  • 问题:

             PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
                 PubsubIO.readMessagesWithAttributes()
                     .fromSubscription("projects/project1/subscriptions/subscription1"));
    
             PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
                 .of(Duration.standardMinutes(30))));
                 
             PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
                 ParDo.of(new JsonUnmarshallFn())
                     .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                         TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));
    
             PCollectionTuple childRecordsTuple = unmarshalResultTuple
                 .get(JsonUnmarshallFn.mainOutputTag)
                 .apply("FetchChildsFromDBAndProcess",
                     ParDo.of(new ChildsReadFn() )
                         .withOutputTags(ChildsReadFn.mainOutputTag,
                             TupleTagList.of(ChildsReadFn.deadLetterTag)));
    
             // input is KV of (childId, msgids), output is mutations to write to BT
             PCollectionTuple postProcessTuple = childRecordsTuple
                 .get(ChildsReadFn.mainOutputTag)
                 .apply(GroupByKey.create())
                 .apply("UpdateChildAssociations",
                     ParDo.of(new ChildsProcessorFn())
                         .withOutputTags(ChildsProcessorFn.mutations,
                             TupleTagList.of(ChildsProcessorFn.deadLetterTag)));
    
             postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
    
  • 应在何处引入窗口?我们目前在步骤1之后有30分钟的固定窗口

  • Beam如何定义30分钟窗口的开始和结束时间?是在我们启动管道之后,还是在批处理的第一条消息之后

  • 如果步骤2至步骤5对一个窗口的时间超过1小时,并且下一个窗口批处理已准备就绪,该怎么办。两个windows批处理会并行处理吗

  • 如何使下一个窗口消息等待上一个窗口批处理完成

    • 如果我们不这样做,下一批将覆盖childId级别的结果
  • 代码片段:

             PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
                 PubsubIO.readMessagesWithAttributes()
                     .fromSubscription("projects/project1/subscriptions/subscription1"));
    
             PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
                 .of(Duration.standardMinutes(30))));
                 
             PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
                 ParDo.of(new JsonUnmarshallFn())
                     .withOutputTags(JsonUnmarshallFn.mainOutputTag,
                         TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));
    
             PCollectionTuple childRecordsTuple = unmarshalResultTuple
                 .get(JsonUnmarshallFn.mainOutputTag)
                 .apply("FetchChildsFromDBAndProcess",
                     ParDo.of(new ChildsReadFn() )
                         .withOutputTags(ChildsReadFn.mainOutputTag,
                             TupleTagList.of(ChildsReadFn.deadLetterTag)));
    
             // input is KV of (childId, msgids), output is mutations to write to BT
             PCollectionTuple postProcessTuple = childRecordsTuple
                 .get(ChildsReadFn.mainOutputTag)
                 .apply(GroupByKey.create())
                 .apply("UpdateChildAssociations",
                     ParDo.of(new ChildsProcessorFn())
                         .withOutputTags(ChildsProcessorFn.mutations,
                             TupleTagList.of(ChildsProcessorFn.deadLetterTag)));
    
             postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
    
    PCollection messages=pipeline.apply(“ReadPubSubscription”,
    publisubio.readMessagesWithAttributes()
    .fromSubscription(“projects/project1/subscriptions/subscription1”);
    PCollection windowedMessages=messages.apply(窗口)到(固定窗口
    .的(时长.标准分钟(30));
    PCollectionTuple unmarshalResultTuple=windowedMessages.apply(“UnmarshalJSonString”,
    ParDo.of(新的JSONumarshallfn())
    .不带输出标签(jsonumarshallfn.main输出标签,
    TupleTagList.of(jsonumarshallfn.deadLetterTag));
    PCollectionTuple childRecordsTuple=unmarshalResultTuple
    .get(jsonumarshallfn.mainpoutputtag)
    .apply(“FetchChildsFromDBAndProcess”,
    (新子女的)伴侣
    .带输出标签(ChildsReadFn.main输出标签,
    (ChildsReadFn.deadLetterTag));
    //输入为KV of(childId,msgids),输出为写入BT的突变
    PCollectionTuple postProcessTuple=childRecordsTuple
    .get(ChildsReadFn.main输出ag)
    .apply(GroupByKey.create())
    .apply(“UpdateChildAssociations”,
    (新子女处理器fn()的伙伴关系
    .不带输出标签(儿童处理器fn.突变,
    TupleTagList.of(ChildsProcessorFn.deadLetterTag));
    get(ChildsProcessorFn.mutations).CloudBigtableIO.write(…);
    
    解决您的每个问题

    关于在Apache Beam中设置窗口时出现的问题12,您需要了解“作业之前存在窗口”。我的意思是,windows从UNIX时代开始(时间戳=0)。换句话说,您的数据将在每个固定的时间范围内分配,例如,固定的60秒窗口:

      PCollection<String> items = ...;
        PCollection<String> fixedWindowedItems = items.apply(
            Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
    
    PCollection items=。。。;
    PCollection fixedWindowedItems=items.apply(
    进入(固定窗口的持续时间标准秒(60));
    
    第一个窗口:[0s;59s)-第二个窗口:[60s;120s)…依此类推 请参阅文档,以及

    关于问题3,Apache Beam中窗口和触发的默认设置是忽略延迟数据。尽管可以使用配置延迟数据的处理。为此,必须了解水印的概念。水印是衡量数据落后多远的指标。例如:您可以使用3秒水印,则如果数据延迟3秒,它将被分配到正确的窗口。另一方面,如果它传递了水印,您可以定义此数据将发生什么,您可以使用重新处理或忽略它

    允许迟到

    PCollection items=。。。;
    PCollection fixedWindowedItems=items.apply(
    Window.into(FixedWindows.of(Duration.standardMinutes(1)))
    .允许迟到(持续时间:标准天数(2));
    
    请注意,设置了延迟数据到达的时间量

    触发

    PCollection pc=。。。;
    pc.apply(窗口进入(固定窗口)(1,时间单位分钟))
    .triggering(在ProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))之后)
    .允许迟到(持续时间标准分钟(30));
    
    请注意,当有延迟数据时,窗口将被重新处理并重新计算事件时间。此触发器使您有机会对延迟数据作出反应


    最后,关于问题4,部分用上述概念解释。计算将在每个固定窗口内进行,并在每次触发时重新计算/处理。此逻辑将确保您的数据在正确的窗口中。

    解决您的每个问题

    关于在Apache Beam中使用窗口时出现的问题12,您需要了解“作业之前存在windows”。我的意思是,windows从UNIX时代开始(时间戳=0)。换句话说,您的数据将在每个固定时间范围内分配,例如,固定60秒w