Google cloud platform Google云数据流-在bigquery中批量插入_Google Cloud Platform_Google Cloud Dataflow

Google cloud platform Google云数据流-在bigquery中批量插入

google-cloud-platform google-cloud-dataflow

Google cloud platform Google云数据流-在bigquery中批量插入,google-cloud-platform,google-cloud-dataflow,Google Cloud Platform,Google Cloud Dataflow,我能够创建一个数据流管道，从发布/订阅中读取数据，并在处理后以流模式写入大查询现在，我希望以批处理模式运行管道，而不是流模式，以降低成本目前，我的管道正在使用动态目标在bigquery中进行流式插入。我想知道是否有一种方法可以执行具有动态目标的批插入操作以下是 public class StarterPipeline { public interface StarterPipelineOption extends PipelineOptions { /** *

我能够创建一个数据流管道，从发布/订阅中读取数据，并在处理后以流模式写入大查询

现在，我希望以批处理模式运行管道，而不是流模式，以降低成本

目前，我的管道正在使用动态目标在bigquery中进行流式插入。我想知道是否有一种方法可以执行具有动态目标的批插入操作

以下是

public class StarterPipeline {  
   public interface StarterPipelineOption extends PipelineOptions {

    /**
     * Set this required option to specify where to read the input.
     */
    @Description("Path of the file to read from")
    @Default.String(Constants.pubsub_event_pipeline_url)
    String getInputFile();

    void setInputFile(String value);

}

@SuppressWarnings("serial")
public static void main(String[] args) throws SocketTimeoutException {

    StarterPipelineOption options = PipelineOptionsFactory.fromArgs(args).withValidation()
            .as(StarterPipelineOption.class);

    Pipeline p = Pipeline.create(options);

    PCollection<String> datastream = p.apply("Read Events From Pubsub",
            PubsubIO.readStrings().fromSubscription(Constants.pubsub_event_pipeline_url));

    PCollection<String> windowed_items = datastream.apply(Window.<String>into(new GlobalWindows())
            .triggering(Repeatedly.forever(
                    AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardSeconds(300))))
            .withAllowedLateness(Duration.standardDays(10)).discardingFiredPanes());

    // Write into Big Query
     windowed_items.apply("Read and make event table row", new
     ReadEventJson_bigquery())

     .apply("Write_events_to_BQ",
     BigQueryIO.writeTableRows().to(new DynamicDestinations<TableRow, String>() {
     public String getDestination(ValueInSingleWindow<TableRow> element) {
     String destination = EventSchemaBuilder
     .fetch_destination_based_on_event(element.getValue().get("event").toString());
     return destination;
     }

     @Override
     public TableDestination getTable(String table) {
     String destination =
     EventSchemaBuilder.fetch_table_name_based_on_event(table);
     return new TableDestination(destination, destination);
     }

     @Override
     public TableSchema getSchema(String table) {
     TableSchema table_schema =
     EventSchemaBuilder.fetch_table_schema_based_on_event(table);
     return table_schema;
     }
     }).withCreateDisposition(CreateDisposition.CREATE_NEVER)
     .withWriteDisposition(WriteDisposition.WRITE_APPEND)
     .withFailedInsertRetryPolicy(InsertRetryPolicy.retryTransientErrors()));

    p.run().waitUntilFinish();

    log.info("Events Pipeline Job Stopped");

}

公共类启动程序管道{
公共接口启动器PipeLineOption扩展了PipelineOptions{
/**
*设置此必需选项以指定读取输入的位置。
*/
@说明（“要从中读取的文件的路径”）
@Default.String（Constants.pubsub\u event\u pipeline\u url）
字符串getInputFile（）；
void setInputFile（字符串值）；
}
@抑制警告（“串行”）
公共静态void main（字符串[]args）抛出SocketTimeoutException{
StarterPipelineOption options=PipelineOptionsFactory.fromArgs（args）.withValidation（）
.as（StarterPipelineOption.class）；
Pipeline p=Pipeline.create（选项）；
PCollection datastream=p.apply（“从Pubsub读取事件”，
PubsubIO.readStrings（）.fromSubscription（Constants.pubsub_event_pipeline_url））；
PCollection windowed_items=datastream.apply（Window.into（new GlobalWindows（））
.触发（反复地、永远地）(
AfterProcessingTime.pastFirstElementInPane（）.plusDelayOf（Duration.standardSeconds（300）））
.允许迟到（持续时间.标准天数（10））。丢弃火锅（）；
//写入大查询
窗口_项。应用（“读取并生成事件表行”，新建
ReadEventJson_bigquery（））
.apply（“将事件写入事件”，
BigQueryIO.writeTableRows（）.to（新的DynamicDestinations（））{
公共字符串getDestination（ValueInSingleWindow元素）{
String destination=eventschemabilder
.fetch_destination_基于_事件（element.getValue（）.get（“事件”）.toString（））；
返回目的地；
}
@凌驾
公共表目标可获取（字符串表）{
字符串目的地=
根据事件（表）获取表名；
返回新表destination（destination，destination）；
}
@凌驾
公共表模式getSchema（字符串表）{
表模式表模式=
EventSchemaBuilder.根据事件（表）获取表模式；
返回表_模式；
}
}).withCreateDisposition（CreateDisposition.CREATE\u NEVER）
.withWriteDisposition（WriteDisposition.WRITE_APPEND）
.具有失败的内部策略（InsertRetryPolicy.retryTransientErrors（））；
p、 run（）.waitUntilFinish（）；
log.info（“事件管道作业已停止”）；
}

}批处理或流式处理由PCollection决定，因此您需要将数据流PCollection从发布/订阅转换为批处理PCollection以写入BigQuery。允许这样做的转换是

请注意，由于此转换使用键值对，因此批处理将仅包含单个键的元素。对于非KV元件

使用此转换将PCollection创建为批处理后，请使用动态目标应用BigQuery写入，就像对stream PCollection所做的那样。

您可以通过使用来限制成本。声明BigQueryIO.Write支持将数据插入使用BigQueryIO.Write.withMethod（org.apache.beam.sdk.io.gcp.BigQuery.BigQueryIO.Write.Method）指定的BigQuery的两种方法。如果未提供任何方法，则将根据输入PCollection选择默认方法。有关这些方法的更多信息，请参阅

不同的插入方法在成本、配额和数据一致性方面提供了不同的权衡。有关这些权衡的更多信息，请参阅

使用Java GroupIntoBatchs不会降低成本，因为仍将执行流式处理。这用于调整批次大小。在批处理管道中，所有记录一起进行。在流式管道中，记录在进入管道或windows fire时都会经历一些步骤。GroupIntoBatchs所做的是获取单个记录并按计数对它们进行分组，但管道本身仍在流模式下运行。