Google cloud dataflow 为什么从Dataflow/Beam管道写入BigQuery的速度很慢?

Google cloud dataflow 为什么从Dataflow/Beam管道写入BigQuery的速度很慢?,google-cloud-dataflow,apache-beam,Google Cloud Dataflow,Apache Beam,我们有一个非常简单的管道,它从GCS读取数据,执行一个简单的ParDo,然后将结果写入BigQuery。它可以自动缩放到50个虚拟机,在GCP上运行,并且不做任何花哨的事情 从地面军事系统读取所有数据(~10B记录&~700+GB)并进行转换,这一切都相对较快(在前7-10分钟内) 但是,当它开始编写BigQuery时(使用BigQueryIO),它的速度会立即减慢,即使它只需要写大约1M条记录(~60MB)。仅这一步就需要20米 除了对BigQuery的缓慢写入之外,该图还显示该步骤已“停止”

我们有一个非常简单的管道,它从GCS读取数据,执行一个简单的
ParDo
,然后将结果写入BigQuery。它可以自动缩放到50个虚拟机,在GCP上运行,并且不做任何花哨的事情

从地面军事系统读取所有数据(~10B记录&~700+GB)并进行转换,这一切都相对较快(在前7-10分钟内)

但是,当它开始编写BigQuery时(使用
BigQueryIO
),它的速度会立即减慢,即使它只需要写大约1M条记录(~60MB)。仅这一步就需要20米

除了对BigQuery的缓慢写入之外,该图还显示该步骤已“停止”,即使该步骤已成功(尽管非常缓慢)。这一步骤看起来也过于复杂,只需简单地写入BigQuery即可(见下图)

瓶颈出现在执行操作BigQueryIO.Write/BatchLoads/WriterName的步骤
时(请参阅下面的日志)

我的代码中有什么地方做错了吗

代码:

public class Pipeline {
    private static final String BIG_QUERY_TABLE = "<redacted>:<redacted>.melbourne_titles";
    private static final String BUCKET = "gs://<redacted>/*.gz";

    public static void main(String[] args) {
        DataflowPipelineOptions options = PipelineOptionsFactory
                .fromArgs(args)
                .withValidation()
                .as(DataflowPipelineOptions.class);
        options.setAutoscalingAlgorithm(THROUGHPUT_BASED);
        Pipeline pipeline = Pipeline.create(options);
        pipeline.apply(TextIO.read().from(BUCKET).withCompressionType(GZIP))
                .apply(ParDo.of(new DoFn<String, TableRow>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception {
                        String input = c.element();
                        String title = input.split(",")[5];
                        if (title.toLowerCase().contains("melbourne")) {
                            TableRow tableRow = new TableRow();
                            tableRow.set("title", title);
                            c.output(tableRow);
                        }
                    }
                }))
                .apply(BigQueryIO.writeTableRows()
                        .to(BIG_QUERY_TABLE)
                        .withCreateDisposition(CREATE_IF_NEEDED)
                        .withWriteDisposition(WRITE_TRUNCATE)
                        .withSchema(getSchema()));
        pipeline.run();
    }

    private static TableSchema getSchema() {
        List<TableFieldSchema> fields = new ArrayList<>();
        fields.add(new TableFieldSchema().setName("title").setType("STRING"));
        TableSchema schema = new TableSchema().setFields(fields);
        return schema;
    }
}
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Create
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:30:23) Starting 50 workers in australia-southeast1-a...
  2017-08-25 (21:30:23) Executing operation TextIO.Read/Read+ParDo(Anonymous)+BigQueryIO.Write/PrepareWrite/ParDo(Anonymous)...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/TriggerIdCreation/Read(CreateSource)+BigQueryIO.Writ...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/Create/Read(CreateSource)+BigQueryIO.Write/BatchLoad...
  2017-08-25 (21:31:21) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:31:21) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:31:22) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:31:23) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:38:10) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/CreateDataflowView
  2017-08-25 (21:38:13) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/CreateDataflowView
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Close
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Read+BigQueryIO.Write/BatchLoads/GroupByK...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Create
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Create
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Close
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Close
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Read+BigQueryIO...
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Read+BigQueryIO...
  2017-08-25 (21:39:00) Executing operation s35-u80
  2017-08-25 (21:39:01) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/Flatten.PCollections
  2017-08-25 (21:39:03) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/CreateDataflowView
  2017-08-25 (21:39:06) Executing operation BigQueryIO.Write/BatchLoads/ResultsView/CreateDataflowView
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Create
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Create
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/Create.Values/Read(CreateSource)+BigQueryIO.Write/Ba...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Close
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Close
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read+BigQueryIO...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Read+BigQueryIO....
  2017-08-25 (21:39:35) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:35) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:46) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/CreateDataflowView
  2017-08-25 (21:39:46) Executing operation BigQueryIO.Write/BatchLoads/WriteRename
  2017-08-25 (21:57:43) Stopping worker pool...
外观过于复杂的步骤:

public class Pipeline {
    private static final String BIG_QUERY_TABLE = "<redacted>:<redacted>.melbourne_titles";
    private static final String BUCKET = "gs://<redacted>/*.gz";

    public static void main(String[] args) {
        DataflowPipelineOptions options = PipelineOptionsFactory
                .fromArgs(args)
                .withValidation()
                .as(DataflowPipelineOptions.class);
        options.setAutoscalingAlgorithm(THROUGHPUT_BASED);
        Pipeline pipeline = Pipeline.create(options);
        pipeline.apply(TextIO.read().from(BUCKET).withCompressionType(GZIP))
                .apply(ParDo.of(new DoFn<String, TableRow>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception {
                        String input = c.element();
                        String title = input.split(",")[5];
                        if (title.toLowerCase().contains("melbourne")) {
                            TableRow tableRow = new TableRow();
                            tableRow.set("title", title);
                            c.output(tableRow);
                        }
                    }
                }))
                .apply(BigQueryIO.writeTableRows()
                        .to(BIG_QUERY_TABLE)
                        .withCreateDisposition(CREATE_IF_NEEDED)
                        .withWriteDisposition(WRITE_TRUNCATE)
                        .withSchema(getSchema()));
        pipeline.run();
    }

    private static TableSchema getSchema() {
        List<TableFieldSchema> fields = new ArrayList<>();
        fields.add(new TableFieldSchema().setName("title").setType("STRING"));
        TableSchema schema = new TableSchema().setFields(fields);
        return schema;
    }
}
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Create
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:30:23) Starting 50 workers in australia-southeast1-a...
  2017-08-25 (21:30:23) Executing operation TextIO.Read/Read+ParDo(Anonymous)+BigQueryIO.Write/PrepareWrite/ParDo(Anonymous)...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/TriggerIdCreation/Read(CreateSource)+BigQueryIO.Writ...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/Create/Read(CreateSource)+BigQueryIO.Write/BatchLoad...
  2017-08-25 (21:31:21) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:31:21) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:31:22) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:31:23) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:38:10) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/CreateDataflowView
  2017-08-25 (21:38:13) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/CreateDataflowView
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Close
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Read+BigQueryIO.Write/BatchLoads/GroupByK...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Create
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Create
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Close
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Close
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Read+BigQueryIO...
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Read+BigQueryIO...
  2017-08-25 (21:39:00) Executing operation s35-u80
  2017-08-25 (21:39:01) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/Flatten.PCollections
  2017-08-25 (21:39:03) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/CreateDataflowView
  2017-08-25 (21:39:06) Executing operation BigQueryIO.Write/BatchLoads/ResultsView/CreateDataflowView
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Create
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Create
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/Create.Values/Read(CreateSource)+BigQueryIO.Write/Ba...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Close
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Close
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read+BigQueryIO...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Read+BigQueryIO....
  2017-08-25 (21:39:35) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:35) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:46) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/CreateDataflowView
  2017-08-25 (21:39:46) Executing operation BigQueryIO.Write/BatchLoads/WriteRename
  2017-08-25 (21:57:43) Stopping worker pool...

工作详细信息:

public class Pipeline {
    private static final String BIG_QUERY_TABLE = "<redacted>:<redacted>.melbourne_titles";
    private static final String BUCKET = "gs://<redacted>/*.gz";

    public static void main(String[] args) {
        DataflowPipelineOptions options = PipelineOptionsFactory
                .fromArgs(args)
                .withValidation()
                .as(DataflowPipelineOptions.class);
        options.setAutoscalingAlgorithm(THROUGHPUT_BASED);
        Pipeline pipeline = Pipeline.create(options);
        pipeline.apply(TextIO.read().from(BUCKET).withCompressionType(GZIP))
                .apply(ParDo.of(new DoFn<String, TableRow>() {
                    @ProcessElement
                    public void processElement(ProcessContext c) throws Exception {
                        String input = c.element();
                        String title = input.split(",")[5];
                        if (title.toLowerCase().contains("melbourne")) {
                            TableRow tableRow = new TableRow();
                            tableRow.set("title", title);
                            c.output(tableRow);
                        }
                    }
                }))
                .apply(BigQueryIO.writeTableRows()
                        .to(BIG_QUERY_TABLE)
                        .withCreateDisposition(CREATE_IF_NEEDED)
                        .withWriteDisposition(WRITE_TRUNCATE)
                        .withSchema(getSchema()));
        pipeline.run();
    }

    private static TableSchema getSchema() {
        List<TableFieldSchema> fields = new ArrayList<>();
        fields.add(new TableFieldSchema().setName("title").setType("STRING"));
        TableSchema schema = new TableSchema().setFields(fields);
        return schema;
    }
}
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Create
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:30:23) Starting 50 workers in australia-southeast1-a...
  2017-08-25 (21:30:23) Executing operation TextIO.Read/Read+ParDo(Anonymous)+BigQueryIO.Write/PrepareWrite/ParDo(Anonymous)...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/TriggerIdCreation/Read(CreateSource)+BigQueryIO.Writ...
  2017-08-25 (21:30:23) Executing operation BigQueryIO.Write/BatchLoads/Create/Read(CreateSource)+BigQueryIO.Write/BatchLoad...
  2017-08-25 (21:31:21) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:31:21) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/BatchViewOverrides.GroupByWindowHas...
  2017-08-25 (21:31:22) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:31:23) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/BatchViewOverrides.GroupByWindowH...
  2017-08-25 (21:38:10) Executing operation BigQueryIO.Write/BatchLoads/TempFilePrefixView/CreateDataflowView
  2017-08-25 (21:38:13) Executing operation BigQueryIO.Write/BatchLoads/View.AsSingleton/CreateDataflowView
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Close
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/GroupByKey/Read+BigQueryIO.Write/BatchLoads/GroupByK...
  2017-08-25 (21:38:45) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/Distinct Keys/Combine.perKey(Anonym...
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Create
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Create
  2017-08-25 (21:38:49) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForData/BatchViewOverri...
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Close
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Close
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForSize/Read+BigQueryIO...
  2017-08-25 (21:38:56) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/GBKaSVForKeys/Read+BigQueryIO...
  2017-08-25 (21:39:00) Executing operation s35-u80
  2017-08-25 (21:39:01) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/Flatten.PCollections
  2017-08-25 (21:39:03) Executing operation BigQueryIO.Write/BatchLoads/CalculateSchemas/asMap/CreateDataflowView
  2017-08-25 (21:39:06) Executing operation BigQueryIO.Write/BatchLoads/ResultsView/CreateDataflowView
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Create
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Create
  2017-08-25 (21:39:12) Executing operation BigQueryIO.Write/BatchLoads/Create.Values/Read(CreateSource)+BigQueryIO.Write/Ba...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Close
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Close
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/SinglePartitionsReshuffle/GroupByKey/Read+BigQueryIO...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:33) Executing operation BigQueryIO.Write/BatchLoads/MultiPartitionsReshuffle/GroupByKey/Read+BigQueryIO....
  2017-08-25 (21:39:35) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:35) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/BatchViewOverrides.GroupByWindowHashA...
  2017-08-25 (21:39:46) Executing operation BigQueryIO.Write/BatchLoads/TempTablesView/CreateDataflowView
  2017-08-25 (21:39:46) Executing operation BigQueryIO.Write/BatchLoads/WriteRename
  2017-08-25 (21:57:43) Stopping worker pool...
  • Dataflow Java SDK:2.0.0
  • 工作编号:2017-08-25_04_29_54-7210937293145071720
更新

我认为问题在于Dataflow生成的文件数量过多,随后BigQuery必须加载。它可能只有1M行,但Dataflow正在生成850多个要加载的文件:

  "configuration" : {
    "load" : {
      "createDisposition" : "CREATE_IF_NEEDED",
      "destinationTable" : {
        "datasetId" : "dataflow_on_a_tram",
        "projectId" : "grey-sort-challenge",
        "tableId" : "melbourne_titles"
      },
      "schema" : {
        "fields" : [ {
          "name" : "year",
          "type" : "STRING"
        }, {
          "name" : "month",
          "type" : "STRING"
        }, {
          "name" : "day",
          "type" : "STRING"
        }, {
          "name" : "wikimedia_project",
          "type" : "STRING"
        }, {
          "name" : "language",
          "type" : "STRING"
        }, {
          "name" : "title",
          "type" : "STRING"
        }, {
          "name" : "views",
          "type" : "INTEGER"
        } ]
      },
      "sourceFormat" : "NEWLINE_DELIMITED_JSON",
      "sourceUris" : [
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/51221a43-8fd8-417d-90ca-2f3c3e5789d2",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/5e1c3cb8-20d1-45ef-b0bb-209645c36093",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/0ed8d240-2bc2-4c8b-808d-792540448c73",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/d7a1fefe-6dd8-4f30-bf97-040c3692e448",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/b7c4d9a8-d45d-4cc6-b375-291e6435ed53",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/17a7bbf4-5695-4188-b03a-3ef5cda8607c",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/783af461-c114-4a41-aa5f-ed1c7db86bab",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/dad046fc-eabf-4212-83f1-7d7fa71075c1",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/7b9ffec1-7424-4248-83b4-98a4ef4233b9",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/bb297232-8e84-4a14-9dc6-3efde1b2b586",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/0693972a-1319-4637-af9f-8a4a3d5cb0f7",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/41b1e722-f76c-404d-a71b-bd36c09e8a06",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/19cfd89e-c9ee-4221-aee1-b3503dbcd93b",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/574467f2-5771-479a-b213-2941225a24bd",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/4d872304-0f42-47f2-89cf-b3a3f856ca67",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/1c086246-8eec-4bbe-be98-b01abb181d33",
 "gs://<redacted>/tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/9439f5f4-5020-471d-b631-e1a3fea1584f",
“配置”:{
“加载”:{
“createDisposition”:“需要时创建”,
“destinationTable”:{
“数据集”:“有轨电车上的数据流”,
“projectId”:“灰色排序挑战”,
“表格ID”:“墨尔本大学标题”
},
“模式”:{
“字段”:[{
“姓名”:“年份”,
“类型”:“字符串”
}, {
“姓名”:“月份”,
“类型”:“字符串”
}, {
“名称”:“日期”,
“类型”:“字符串”
}, {
“名称”:“wikimedia_项目”,
“类型”:“字符串”
}, {
“名称”:“语言”,
“类型”:“字符串”
}, {
“姓名”:“职务”,
“类型”:“字符串”
}, {
“名称”:“视图”,
“类型”:“整数”
} ]
},
“sourceFormat”:“换行符分隔的JSON”,
“源URI”:[
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/51221A4-8fd8-417d-90ca-2f3c3e5789d2”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/5e1c3cb8-20d1-45ef-b0bb-209645c36093”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/0ed8d240-2bc2-4c8b-808d-792540448c73”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/d7a1fefe-6dd8-4f30-bf97-040c3692e448”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/b7c4d9a8-d45d-4cc6-b375-291e6435ed53”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/17a7bbf4-5695-4188-b03a-3ef5cda8607c”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/783af461-c114-4a41-aa5f-ed1c7db86bab”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/dad046fc-eabf-4212-83f1-7d7fa71075c1”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/7b9ffec1-7424-4248-83b4-98a4ef4233b9”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/bb297232-8e84-4a14-9dc6-3efde1b2b586”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/0693972a-1319-4637-af9f-8a4a3d5cb0f7”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/41b1e722-f76c-404d-a71b-bd36c09e8a06”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/19cfd89e-c9ee-4221-aee1-b3503dbcd93b”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/574467f2-5771-479a-b213-2941225a24bd”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/4d872304-0f42-47f2-89cf-b3a3f856ca67”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/1c086246-8eec-4bbe-be98-b01abb181d33”,
“gs:///tmp/BigQueryWriteTemp/615faf65cef743718624cbeb8fd96f14/9439f5f4-5020-471d-b631-e1a3fea1584f”,

[…]851个文件!

请记住,BigQuery不保证加载作业的延迟。如果同时发出许多其他加载作业,您的作业可能会在队列中等待调度。如果您可以再次运行此作业,我们应该能够帮助您检查BigQuery加载作业本身,以查看发生了什么情况。

查看工人记录,似乎只是在等待BQ加载作业完成。可能是BQ作业比平时慢。这个问题正常吗?我再次运行了它。同样的问题。2017-08-25_14_26_18-5377718284053913263@Pablo-查看我的更新。我认为它太慢了,因为数据流产生了太多文件供BQ加载。你试过设置吗在BigQuerySink中每个包有g个最大文件?我没有。我会查出来的。。