Java 从Pubsub读取数据并写入GCS的Google数据流作业非常慢（WriteFile/WriteSharedBundleStotempFiles/GroupIntoHards）耗时太长_Java_Java 8_Apache Beam_Google Dataflow

Java 从Pubsub读取数据并写入GCS的Google数据流作业非常慢（WriteFile/WriteSharedBundleStotempFiles/GroupIntoHards）耗时太长

java java-8

Java 从Pubsub读取数据并写入GCS的Google数据流作业非常慢（WriteFile/WriteSharedBundleStotempFiles/GroupIntoHards）耗时太长,java,java-8,apache-beam,google-dataflow,Java,Java 8,Apache Beam,Google Dataflow,目前，我们有一个数据流作业，它使用FileIO.writeDynamic从pubsub读取并将avro文件写入GCS，当我们以10000个事件/秒进行测试时，无法更快地处理，因为WriteFile/WriteSharedBundleStotempFiles/GroupIntoHards非常慢。下面是我们用来编写的代码片段。我们如何改进 PCollection<Event> windowedWrites = input.apply("Global Window",

目前，我们有一个数据流作业，它使用FileIO.writeDynamic从pubsub读取并将avro文件写入GCS，当我们以10000个事件/秒进行测试时，无法更快地处理，因为WriteFile/WriteSharedBundleStotempFiles/GroupIntoHards非常慢。下面是我们用来编写的代码片段。我们如何改进

PCollection<Event> windowedWrites = input.apply("Global Window", Window.<Event>into(new GlobalWindows())
        .triggering(Repeatedly.forever(
            AfterFirst.of(AfterPane.elementCountAtLeast(50000),
                AfterProcessingTime.pastFirstElementInPane().plusDelayOf(DurationUtils
                    .parseDuration(windowDuration))))).discardingFiredPanes());

        return windowedWrites
                        .apply("WriteToAvroGCS", FileIO.<EventDestination, Five9Event>writeDynamic()
                                        .by(groupFn)
                                        .via(outputFn, Contextful.fn(
                                                        new SinkFn()))
                                        .withTempDirectory(avroTempDirectory)
                                        .withDestinationCoder(destinationCoder)
                                        .withNumShards(1).withNaming(namingFn));

PCollection windowedWrites=input.apply（“全局窗口”，Window.into（新的全局窗口（））
.触发（反复地、永远地）(
AfterFirst.of（AfterPane.ElementCount至少（50000），
AfterProcessingTime.pastFirstElementInPane（）.plusDelayOf（DurationUtils
.parseDuration（windowDuration()())）。丢弃FiredPanes（）；
返回窗口写入
.apply（“WriteToAvroGCS”，FileIO.writeDynamic（）
.by（集团fn）
.via（输出fn，Contextful.fn(
新的（fn（）））
.withTempDirectory（avroTempDirectory）
.withDestinationCoder（destinationCoder）
.使用Numshards（1）.使用命名（namingFn））；

我们使用定制的文件命名格式，比如gs://tenantID./eventname/dddddd mm dd/

如评论中所述，问题很可能是

与numshards（1）

的冲突，这迫使所有事情都发生在一个工作者身上。

正如Robert所说，当使用

与numshards（1）

数据流/Beam无法并行写入时，让它发生在同一个工人身上。当束相对较高时，这对管道的性能有很大影响。我举了一个例子来说明这一点：

我运行了3条生成大量元素（~2gb）的管道，其中三条管道有10个

n1-standard-1

工作线程，但有1个碎片、10个碎片和0个碎片（数据流将选择碎片的数量）。这就是他们的行为：

我们看到0或10个碎片与1个碎片的总时间之间存在很大差异。如果我们使用1个碎片进行工作，我们会看到只有一个工人在做某些事情（我禁用了自动缩放）：

正如Reza提到的，之所以会发生这种情况，是因为所有元素都需要被洗牌到同一个worker中，以便它写入1个碎片

请注意，我的示例是Batch，它在线程方面的行为与流式处理不同，但对管道性能的影响非常相似（事实上，流式处理可能是最差的）

这里有一个Python代码，您可以自己进行测试：

    p = beam.Pipeline(options=pipeline_options)

    def long_string_generator():
        string = "Apache Beam is an open source, unified model for defining " \
                 "both batch and streaming data-parallel processing " \
                 "pipelines. Using one of the open source Beam SDKs, " \
                 "you build a program that defines the pipeline. The pipeline " \
                 "is then executed by one of Beam’s supported distributed " \
                 "processing back-ends, which include Apache Flink, Apache " \
                 "Spark, and Google Cloud Dataflow. "

        word_choice = random.sample(string.split(" "), 20)

        return " ".join(word_choice)

    def generate_elements(element, amount=1):
        return [(element, long_string_generator()) for _ in range(amount)]

    (p | Create(range(1500))
       | beam.FlatMap(generate_elements, amount=10000)
       | WriteToText(known_args.output, num_shards=known_args.shards))

    p.run()

您指定1个shard有什么原因吗？因为窗口逻辑指定了50K，我们使用group by，不想随意生成更多文件+1个Inigo的注释，1的shard将要求将窗口的所有值随机排列到单个线程。所以我用NumofShard做了实验。。我将其设置为5，并点击OOM，因为写入的文件太多。使用更多文件不会导致OOM，这实际上有助于避免OOM。如果你得到OOM，尝试使用每个vCPU有更多内存的机器，并将碎片设置为0，以便Beam确定传播碎片的最佳方式。此外，您发送的JIRA似乎没有关联，它的标记是固定的：Dwith writeDymanic by（groupFn），因为我们是基于tenantId进行分组的，所以事件时间戳不应该有助于并行性吗？