Apache kafka 卡夫卡制作人阿帕奇·比姆_Apache Kafka_Google Cloud Dataflow_Apache Beam_Apache Beam Io

Apache kafka 卡夫卡制作人阿帕奇·比姆

apache-kafka google-cloud-dataflow

Apache kafka 卡夫卡制作人阿帕奇·比姆,apache-kafka,google-cloud-dataflow,apache-beam,apache-beam-io,Apache Kafka,Google Cloud Dataflow,Apache Beam,Apache Beam Io,如何获取在apache beam KafkaIO中收到确认的记录基本上，我希望所有没有得到任何确认的记录都转到bigquery表，以便稍后重试。我使用了文档中的以下代码片段 .apply(KafkaIO.<Long, String>read() .withBootstrapServers("broker_1:9092,broker_2:9092") .withTopic("my_topic") // use withTopics(List<

如何获取在apache beam KafkaIO中收到确认的记录

基本上，我希望所有没有得到任何确认的记录都转到bigquery表，以便稍后重试。我使用了文档中的以下代码片段

    .apply(KafkaIO.<Long, String>read()
       .withBootstrapServers("broker_1:9092,broker_2:9092")
       .withTopic("my_topic")  // use withTopics(List<String>) to read from multiple topics.
       .withKeyDeserializer(LongDeserializer.class)
       .withValueDeserializer(StringDeserializer.class)

       // Above four are required configuration. returns PCollection<KafkaRecord<Long, String>>

       // Rest of the settings are optional :

       // you can further customize KafkaConsumer used to read the records by adding more
       // settings for ConsumerConfig. e.g :
       .updateConsumerProperties(ImmutableMap.of("group.id", "my_beam_app_1"))

       // set event times and watermark based on LogAppendTime. To provide a custom
       // policy see withTimestampPolicyFactory(). withProcessingTime() is the default.
       .withLogAppendTime()

       // restrict reader to committed messages on Kafka (see method documentation).
       .withReadCommitted()

       // offset consumed by the pipeline can be committed back.
       .commitOffsetsInFinalize()

       // finally, if you don't need Kafka metadata, you can drop it.g
       .withoutMetadata() // PCollection<KV<Long, String>>
    )
    .apply(Values.<String>create()) // PCollection<String>

.apply（KafkaIO.read（））
.使用BootstrapServer（“broker_1:9092，broker_2:9092”）
.withTopic（“我的主题”）//使用withTopics（列表）读取多个主题。
.withKeyDeserializer（LongDeserializer.class）
.withValueDeserializer（StringDeserializer.class）
//以上四项是必需的配置。返回PCollection
//其余设置是可选的：
//通过添加更多，您可以进一步自定义用于读取记录的KafkaConsumer
//消费者配置的设置。例如：
.updateConsumerProperties（ImmutableMap.of（“group.id”、“my\u beam\u app\u 1”））
//根据LogAppendTime设置事件时间和水印。以提供自定义
//策略请参见withTimestampPolicyFactory（）。默认为withProcessingTime（）。
.withLogAppendTime（）
//将阅读器限制为Kafka上的已提交消息（请参阅方法文档）。
.withReadCommitted（）
//管道消耗的偏移量可以提交回。
.CommitteofSetsinFinalize（）
//最后，如果不需要Kafka元数据，可以将其删除
.withoutMetadata（）//PCollection
)
.apply（Values.create（））//PCollection

默认情况下，Beam IO的设计目的是一直尝试写入/读取/处理元素，直到。（重复错误后，批处理管道将失败）

您所指的通常称为a，用于获取失败的记录并将其添加到PCollection、Pubsub主题、队列服务等。这通常是可以实现的，因为当遇到写入某些记录的错误时，它允许流式管道取得进展（而不是阻塞），但允许写入成功的ONCE

不幸的是，除非我弄错了，否则卡夫卡IO中并没有实现死信队列。也许可以修改卡夫卡约来支持这一点。在beam邮件列表上进行了一些讨论，并提出了一些实现这一点的想法，这一点非常重要

我怀疑有可能将此添加到，捕获失败的记录并将其输出到另一个PCollection。如果您选择执行此操作，请同时联系beam，如果您希望帮助将其合并到master中，他们将能够帮助确保更改涵盖必要的要求，以便可以合并beam并使其作为一个整体具有意义

然后，您的管道可以在其他地方（即不同的源）写入这些内容。当然，如果该辅助源同时出现中断/问题，您将需要另一个DLQ。

我想您可以在这里看一下：您能描述一下项目中数据流动的过程吗？什么时候开始？@muscat谢谢你的文章。但我的问题在文章中也没有得到回答