Google cloud platform Google Dataflow和Pubsub无法实现一次交付_Google Cloud Platform_Google Cloud Dataflow_Apache Beam_Google Cloud Pubsub

Google cloud platform Google Dataflow和Pubsub无法实现一次交付

google-cloud-platform google-cloud-dataflow

Google cloud platform Google Dataflow和Pubsub无法实现一次交付,google-cloud-platform,google-cloud-dataflow,apache-beam,google-cloud-pubsub,Google Cloud Platform,Google Cloud Dataflow,Apache Beam,Google Cloud Pubsub,我正试图使用GoogleDataflow和使用ApacheBeamSDK2.6.0的PubSub实现一次交付用例非常简单： “生成器”数据流作业向PubSub主题发送1M消息 GenerateSequence .from(0) .to(1000000) .withRate(100000, Duration.standardSeconds(1L)); “归档”数据流作业从PubSub订阅读取消息并保存到Google云存储 pipeli

我正试图使用GoogleDataflow和使用ApacheBeamSDK2.6.0的PubSub实现一次交付

用例非常简单：

“生成器”数据流作业向PubSub主题发送1M消息

GenerateSequence
          .from(0)
          .to(1000000)
          .withRate(100000, Duration.standardSeconds(1L));

“归档”数据流作业从PubSub订阅读取消息并保存到Google云存储

pipeline
        .apply("Read events",
            PubsubIO.readMessagesWithAttributes()
                // this is to achieve exactly-once delivery
                .withIdAttribute(ATTRIBUTE_ID)
                .fromSubscription('subscription')
                .withTimestampAttribute(TIMESTAMP_ATTRIBUTE))
        .apply("Window events",
            Window.<Dto>into(FixedWindows.of(Duration.millis(options.getWindowDuration())))
                .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
                .withAllowedLateness(Duration.standardMinutes(15))
                .discardingFiredPanes())
        .apply("Events count metric", ParDo.of(new CountMessagesMetric()))
        .apply("Write files to archive",
            FileIO.<String, Dto>writeDynamic()
                .by(Dto::getDataSource).withDestinationCoder(StringUtf8Coder.of())
                .via(Contextful.of((msg, ctx) -> msg.getData(), Requirements.empty()), TextIO.sink())
                .to(archiveDir)
                .withTempDirectory(archiveDir)
                .withNumShards(options.getNumShards())
                .withNaming(dataSource ->
                    new SyslogWindowedDataSourceFilenaming(dataSource, archiveDir, filenamePrefix, filenameSuffix)
                ));

管道
.apply（“读取事件”，
publisubio.readMessagesWithAttributes（）
//这是为了实现一次交付
.withIdAttribute（属性\u ID）
.fromSubscription（“订阅”）
.WithTimestamp属性（TIMESTAMP_属性））
.apply（“窗口事件”，
into（FixedWindows.of（Duration.millis（options.getWindowDuration（）））
.triggering（重复.forever（AfterWatermark.pastEndOfWindow（）））
.允许迟到（持续时间.标准分钟（15））
.discardingFiredPanes（））
.apply（“事件计数度量”，ParDo.of（new countmessages metric（））
.apply（“将文件写入存档”，
FileIO.writeDynamic（）
.by（Dto:：getDataSource）.withDestinationCoder（StringUtf8Coder.of（））
.via（Contextful.of（（msg，ctx）->msg.getData（），Requirements.empty（）），TextIO.sink（））
.致（阿奇维迪尔）
.withTempDirectory（archiveDir）
.withNumShards（options.getNumShards（））
.withNaming（数据源->
新的SyslogWindowedDatasourceFileNameing（数据源、archiveDir、filenamePrefix、filenameSuffix）
));

我在Pubsub.IO.Write（'Generator'作业）和PubsubIO.Read（'Archive'作业）中都添加了'withIdAttribute'，希望它能保证一次语义

我想测试“负面”场景：

“生成器”数据流作业向PubSub主题发送1M消息

存档的数据流作业开始工作，但在处理“停止工作”->“漏”的过程中，我停止了它。部分消息已被处理并保存到云存储中，比如400K消息

我再次启动“归档”工作，并预计它将占用未经处理的邮件（600K），最终我将看到整整一百万封保存到存储中的邮件

事实上，我得到的是——所有消息都已发送（至少一次），但除此之外还有很多重复消息——大约每100万条消息中有30-50K条

有什么解决方案可以实现一次交货吗？

所以，我自己从来没有这样做过，但考虑到你的问题，我会这样做

我的解决方案有点复杂，但在不涉及其他外部服务的情况下，我没有找到实现这一点的其他方法。所以，这里什么都没有

您可以让管道同时从pubsub和GCS读取数据，然后将它们合并以消除重复数据。这里棘手的部分是，一个是有界pCollection（GCS），另一个是无界pCollection（pubsub）。您可以将其添加到有界集合，然后打开数据窗口。在此阶段，您可能会删除超过15分钟的地面军事系统数据（之前实施中的窗口持续时间）。这两个步骤（即正确添加时间戳和删除可能足够旧而不会创建重复的数据）是迄今为止最棘手的部分

解决此问题后，附加两个PCollection，然后在两组数据共用的Id上使用。这将产生一个

PCollection数据流，使您无法在运行期间保持状态。如果您使用Java，您可以以一种不会导致它丢失现有状态的方式，允许跨管道版本进行重复数据消除
如果这对您不起作用，您可能希望以一种由属性_ID设置密钥的方式存档消息，例如，。或者GCS使用此作为文件名。
即使不中断存档工作，您也会收到重复的邮件吗？不，我不认为“快乐路径”很有效。若“归档”工作并没有中断，我将在存储器中收到100万封邮件。