Scala 在akka streams中为JsonFraming添加EOF时添加自定义逻辑/回调/处理程序_Scala_Apache Kafka_Akka_Akka Stream

Scala 在akka streams中为JsonFraming添加EOF时添加自定义逻辑/回调/处理程序

scala apache-kafka akka

Scala 在akka streams中为JsonFraming添加EOF时添加自定义逻辑/回调/处理程序,scala,apache-kafka,akka,akka-stream,Scala,Apache Kafka,Akka,Akka Stream,我有一个流程，在这个流程中，我从Kafka pah主题中以小批量使用文件路径，读取文件本身（大JSON数组），然后将它们写回Kafka数据主题看起来是这样的： val fileFlow = Flow[Path].flatMapConcat(HdfsSource.data(fs, _)) .via(JsonFraming.objectScanner(Int.MaxValue)) Consumer .committableSource(ne

我有一个流程，在这个流程中，我从Kafka pah主题中以小批量使用文件路径，读取文件本身（大JSON数组），然后将它们写回Kafka数据主题

看起来是这样的：

      val fileFlow = Flow[Path].flatMapConcat(HdfsSource.data(fs, _))
        .via(JsonFraming.objectScanner(Int.MaxValue))

      Consumer
        .committableSource(newConsumerSettings, Subscriptions.topics(inputTopicNames))
        .map(value => value)
        .grouped(kafkaConsumerBatch)
        .flatMapConcat(paths => Source(paths))
        .map(path => new Path(path.record.value().get("fullPath").asInstanceOf[String]))
        //Based on: https://github.com/akka/alpakka/blob/v3.0.0/doc-examples/src/test/scala/akka/stream/alpakka/eip/scaladsl/PassThroughExamples.scala#L72-L92
        .via(PassThroughFlow(fileFlow))
        .map { case (bytes, path) => (bytes, entityConfigMap(getCountryPrefix(path))) }
        .map(bytesAndPath => (bytesAndPath._1.utf8String.parseJson.asJsObject, bytesAndPath._2))
        .map { case (bytes, entityConfig) => (toGenericRecord(bytes, entityConfig), entityConfig) }
        .map { case (record, entityConfig) =>
          producerMessagesToTopic.mark()
          ProducerMessage.single(
            new ProducerRecord[NotUsed, GenericRecord](getDataTopicName(entityConfig), record),
            passThrough = entityConfig)
        }
        .via {
          akka.kafka.scaladsl.Producer.flexiFlow(prodSettings)
        }
....More logic for logging and running/materializing the flow

现在的问题是，正如我所说的，这些JSON文件很大，所以我不能将整个文件内容分成单独的对象，将它们全部存储到Kafka，然后再提交。我的意思是，这就是我需要做的，但我还需要根据EOF事件控制偏移量提交

我想让Producer以自己的速度将数据发送到Kafka，而不管它的配置如何，但不知何故将我的自定义逻辑注入到EOF事件中。可能是一个passThrough字段，表示文件已被完全使用，我们现在可以提交上游路径主题的偏移量。

objectScanner

在其定义中有一个

GraphStageLogic

，该定义具有

onUpstreamFinish

回调，但无法直接访问它进行覆盖。像

simplelinearraphstage

，

JsonObjectParser

这样的类被标记为内部API

…我不能将整个文件内容，将其框架化为单独的对象，将它们全部存储到Kafka，然后再提交

由于偏移提交实际上是确认您已完全处理了文件，因此（如果我弄错了，您可以对此进行评论），在将该偏移处的消息中的文件中的所有对象生成给Kafka之前，没有办法不提交偏移

Source.via（Flow.flatMapConcat.via（…）.map.via（…）

的缺点是它是一个单独的流，第一个和第二个

via（

s）之间的所有内容都需要一段时间

如果您同意从输出主题中的文件交错对象，并且同意从给定文件到输出主题生成两次对象的不可避免的可能性（这两种情况可能会或可能不会对该主题的下游使用者的实现施加有意义的约束/困难），您可以并行处理文件。

mapsync

流阶段在以下方面特别有用：

import akka.Done

// assuming there's an implicit Materializer/ActorSystem (depending on the version of Akka Streams you're running) in scope
def process(path: Path): Future[Done] =
  Source.single(path)
    .via(PassThroughFlow(fileFlow))
    .map { case (bytes, path) => (bytes, entityConfigMap(getCountryPrefix(path))) }
    .map(bytesAndPath => (bytesAndPath._1.utf8String.parseJson.asJsObject, bytesAndPath._2))
    .map { case (bytes, entityConfig) => (toGenericRecord(bytes, entityConfig), entityConfig) }
    .map { case (record, entityConfig) =>
      producerMessagesToTopic.mark()
      ProducerMessage.single(
        new ProducerRecord[NotUsed, GenericRecord](getDataTopicName(entityConfig), record),
        passThrough = entityConfig)
    }
    .via {
      akka.kafka.scaladsl.Producer.flexiFlow(prodSettings)
    }
    .runWith(Sink.ignore)

 // then starting right after .flatMapConcat(paths => Source(paths))
 .mapAsync(parallelism) { committableMsg =>
   val p = new Path(committableMsg.record.value().get("fullPath").asInstanceOf[String])
   process(p).map { _ => committableMsg.committableOffset }
 }
 // now have the committable offsets

并行性

然后限制在给定时间处理的路径数。对提交人的排序是保持不变的（即，在所有消息都被完全处理之前，偏移量永远不会到达提交人）。

我将重新表述它，Levi。我的意思是，我不能将整个文件内容保存在制作人的缓冲区中，因为这些是Datalake文件，它们很容易超过100万。所以，我只需要在文件被完全处理时存储偏移量，但我需要一种方法将文件的内容分割成块，然后找出如何公开和处理EOF事件以存储偏移量。使用KafkaProducer提供的方法，区块分割可能会开箱即用，但我找不到任何EOF处理方法。使用我的方法，您可以从流完成的事实中获得EOF事件（这完成了

未来的），因为不是将流从HdfsSource
集中到整个流（因此丢弃了流完成），我们将每个文件作为自己的流运行。HdfsSource
是否急切地将所有内容读取到内存中，我不知道，但这不是您要问的问题。从文档中确认，HdfsSource
一次只能读取8k。Levi，这是一个了不起的主意。我明天将尝试重新编写代码并提供进一步的反馈！谢谢