Java 在数据流中从BigQuery写入云存储时，如何设置文件大小而不是碎片数_Java_Google Bigquery_Google Cloud Storage_Google Cloud Dataflow

Java 在数据流中从BigQuery写入云存储时，如何设置文件大小而不是碎片数

java google-bigquery google-cloud-storage google-cloud-dataflow

Java 在数据流中从BigQuery写入云存储时，如何设置文件大小而不是碎片数,java,google-bigquery,google-cloud-storage,google-cloud-dataflow,Java,Google Bigquery,Google Cloud Storage,Google Cloud Dataflow,目前正在使用Dataflow从BigQuery读取表数据，并使用设置数量的碎片写入云存储 //Read Main Input PCollection<TableRow> input = pipeline.apply("ReadTableInput", BigQueryIO.readTableRows().from("dataset.table")); // process and write files input.apply("ProcessRows", ParDo.of

目前正在使用Dataflow从BigQuery读取表数据，并使用设置数量的碎片写入云存储

//Read Main Input
PCollection<TableRow> input = pipeline.apply("ReadTableInput",
    BigQueryIO.readTableRows().from("dataset.table"));

// process and write files
input.apply("ProcessRows", ParDo.of(new Process())
    .apply("WriteToFile", TextIO.write()
        .to(outputFile)
        .withHeader(HEADER)
        .withSuffix(".csv")
        .withNumShards(numShards));

//读取主输入
PCollection输入=pipeline.apply（“ReadTableInput”，
BigQueryIO.readTableRows（）.from（“dataset.table”）；
//处理和写入文件
input.apply（“ProcessRows”，ParDo.of（new Process（））
.apply（“WriteToFile”，TextIO.write（）
.to（输出文件）
.带页眉（页眉）
.withSuffix（“.csv”）
.与numShards（numShards））；

为了管理文件大小，我们估计了将文件保持在一定大小所需的碎片总数

有没有办法代替设置碎片数量，设置文件大小，让碎片成为动态的？

按照设计，这是不可能的。如果深入到Beam的核心，您可以通过编程定义一个执行图，然后运行它。这个过程是高度并行的（

ParDo

意味着“并行执行”），位于同一节点或多个节点/VM上

在这里，shard的数量就是并行写入文件的“writer”的数量。然后，PCollection将被拆分为所有worker写入

大小是非常可变的（例如，消息的大小、文本编码、是否压缩以及压缩因子等），Beam不能依赖它来构建其图形