Google cloud dataflow 具有连续批处理的Apache束流管道
我想做什么:Google cloud dataflow 具有连续批处理的Apache束流管道,google-cloud-dataflow,apache-beam-io,apache-beam,Google Cloud Dataflow,Apache Beam Io,Apache Beam,我想做什么: PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription", PubsubIO.readMessagesWithAttributes() .fromSubscription("projects/project1/subscriptions/subscription1&q
PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
PubsubIO.readMessagesWithAttributes()
.fromSubscription("projects/project1/subscriptions/subscription1"));
PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
.of(Duration.standardMinutes(30))));
PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
ParDo.of(new JsonUnmarshallFn())
.withOutputTags(JsonUnmarshallFn.mainOutputTag,
TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));
PCollectionTuple childRecordsTuple = unmarshalResultTuple
.get(JsonUnmarshallFn.mainOutputTag)
.apply("FetchChildsFromDBAndProcess",
ParDo.of(new ChildsReadFn() )
.withOutputTags(ChildsReadFn.mainOutputTag,
TupleTagList.of(ChildsReadFn.deadLetterTag)));
// input is KV of (childId, msgids), output is mutations to write to BT
PCollectionTuple postProcessTuple = childRecordsTuple
.get(ChildsReadFn.mainOutputTag)
.apply(GroupByKey.create())
.apply("UpdateChildAssociations",
ParDo.of(new ChildsProcessorFn())
.withOutputTags(ChildsProcessorFn.mutations,
TupleTagList.of(ChildsProcessorFn.deadLetterTag)));
postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
- 假设“messageId”是传入消息的唯一Id。例:msgid1、msgid2等
- 假设“childId”是子记录的唯一Id。例:cid1234、cid1235等
- 千伏of(cid1234,映射of(msgid1,msgid2))和千伏of(cid1235,映射of(msgid1,msgid2))
PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
PubsubIO.readMessagesWithAttributes()
.fromSubscription("projects/project1/subscriptions/subscription1"));
PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
.of(Duration.standardMinutes(30))));
PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
ParDo.of(new JsonUnmarshallFn())
.withOutputTags(JsonUnmarshallFn.mainOutputTag,
TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));
PCollectionTuple childRecordsTuple = unmarshalResultTuple
.get(JsonUnmarshallFn.mainOutputTag)
.apply("FetchChildsFromDBAndProcess",
ParDo.of(new ChildsReadFn() )
.withOutputTags(ChildsReadFn.mainOutputTag,
TupleTagList.of(ChildsReadFn.deadLetterTag)));
// input is KV of (childId, msgids), output is mutations to write to BT
PCollectionTuple postProcessTuple = childRecordsTuple
.get(ChildsReadFn.mainOutputTag)
.apply(GroupByKey.create())
.apply("UpdateChildAssociations",
ParDo.of(new ChildsProcessorFn())
.withOutputTags(ChildsProcessorFn.mutations,
TupleTagList.of(ChildsProcessorFn.deadLetterTag)));
postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
- 如果我们不这样做,下一批将覆盖childId级别的结果
PCollection<PubsubMessage> messages = pipeline.apply("ReadPubSubSubscription",
PubsubIO.readMessagesWithAttributes()
.fromSubscription("projects/project1/subscriptions/subscription1"));
PCollection<PubsubMessage> windowedMessages = messages.apply(Window.into(FixedWindows
.of(Duration.standardMinutes(30))));
PCollectionTuple unmarshalResultTuple = windowedMessages.apply("UnmarshalJsonStrings",
ParDo.of(new JsonUnmarshallFn())
.withOutputTags(JsonUnmarshallFn.mainOutputTag,
TupleTagList.of(JsonUnmarshallFn.deadLetterTag)));
PCollectionTuple childRecordsTuple = unmarshalResultTuple
.get(JsonUnmarshallFn.mainOutputTag)
.apply("FetchChildsFromDBAndProcess",
ParDo.of(new ChildsReadFn() )
.withOutputTags(ChildsReadFn.mainOutputTag,
TupleTagList.of(ChildsReadFn.deadLetterTag)));
// input is KV of (childId, msgids), output is mutations to write to BT
PCollectionTuple postProcessTuple = childRecordsTuple
.get(ChildsReadFn.mainOutputTag)
.apply(GroupByKey.create())
.apply("UpdateChildAssociations",
ParDo.of(new ChildsProcessorFn())
.withOutputTags(ChildsProcessorFn.mutations,
TupleTagList.of(ChildsProcessorFn.deadLetterTag)));
postProcessTuple.get(ChildsProcessorFn.mutations).CloudBigtableIO.write(...);
PCollection messages=pipeline.apply(“ReadPubSubscription”,
publisubio.readMessagesWithAttributes()
.fromSubscription(“projects/project1/subscriptions/subscription1”);
PCollection windowedMessages=messages.apply(窗口)到(固定窗口
.的(时长.标准分钟(30));
PCollectionTuple unmarshalResultTuple=windowedMessages.apply(“UnmarshalJSonString”,
ParDo.of(新的JSONumarshallfn())
.不带输出标签(jsonumarshallfn.main输出标签,
TupleTagList.of(jsonumarshallfn.deadLetterTag));
PCollectionTuple childRecordsTuple=unmarshalResultTuple
.get(jsonumarshallfn.mainpoutputtag)
.apply(“FetchChildsFromDBAndProcess”,
(新子女的)伴侣
.带输出标签(ChildsReadFn.main输出标签,
(ChildsReadFn.deadLetterTag));
//输入为KV of(childId,msgids),输出为写入BT的突变
PCollectionTuple postProcessTuple=childRecordsTuple
.get(ChildsReadFn.main输出ag)
.apply(GroupByKey.create())
.apply(“UpdateChildAssociations”,
(新子女处理器fn()的伙伴关系
.不带输出标签(儿童处理器fn.突变,
TupleTagList.of(ChildsProcessorFn.deadLetterTag));
get(ChildsProcessorFn.mutations).CloudBigtableIO.write(…);
解决您的每个问题
关于在Apache Beam中设置窗口时出现的问题1和2,您需要了解“作业之前存在窗口”。我的意思是,windows从UNIX时代开始(时间戳=0)。换句话说,您的数据将在每个固定的时间范围内分配,例如,固定的60秒窗口:
PCollection<String> items = ...;
PCollection<String> fixedWindowedItems = items.apply(
Window.<String>into(FixedWindows.of(Duration.standardSeconds(60))));
PCollection items=。。。;
PCollection fixedWindowedItems=items.apply(
进入(固定窗口的持续时间标准秒(60));
第一个窗口:[0s;59s)-第二个窗口:[60s;120s)…依此类推
请参阅文档,以及
关于问题3,Apache Beam中窗口和触发的默认设置是忽略延迟数据。尽管可以使用配置延迟数据的处理。为此,必须了解水印的概念。水印是衡量数据落后多远的指标。例如:您可以使用3秒水印,则如果数据延迟3秒,它将被分配到正确的窗口。另一方面,如果它传递了水印,您可以定义此数据将发生什么,您可以使用重新处理或忽略它
允许迟到
PCollection items=。。。;
PCollection fixedWindowedItems=items.apply(
Window.into(FixedWindows.of(Duration.standardMinutes(1)))
.允许迟到(持续时间:标准天数(2));
请注意,设置了延迟数据到达的时间量
触发
PCollection pc=。。。;
pc.apply(窗口进入(固定窗口)(1,时间单位分钟))
.triggering(在ProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(1))之后)
.允许迟到(持续时间标准分钟(30));
请注意,当有延迟数据时,窗口将被重新处理并重新计算事件时间。此触发器使您有机会对延迟数据作出反应
最后,关于问题4,部分用上述概念解释。计算将在每个固定窗口内进行,并在每次触发时重新计算/处理。此逻辑将确保您的数据在正确的窗口中。解决您的每个问题 关于在Apache Beam中使用窗口时出现的问题1和2,您需要了解“作业之前存在windows”。我的意思是,windows从UNIX时代开始(时间戳=0)。换句话说,您的数据将在每个固定时间范围内分配,例如,固定60秒w