Google cloud platform 具有混合工作负载的数据流作业-流式插入&;加载作业(Python)

Google cloud platform 具有混合工作负载的数据流作业-流式插入&;加载作业(Python),google-cloud-platform,google-cloud-dataflow,apache-beam,Google Cloud Platform,Google Cloud Dataflow,Apache Beam,我试图按照下面链接中解释的设计实现一个用例,但遇到了错误。任何指点都会大有帮助 逐步解释用例: 从PubSub获取流式原始事件 验证接收到的原始事件 筛选特定类型的事件 创建筛选事件的字典 同时,通过窗口操作传递过滤后的事件并将其聚合 2种类型的输出-原始事件字典、聚合事件字典 按照上面链接中解释的设计,原始事件字典属于低紧急级别,聚合事件属于高紧急级别 尝试对原始事件使用“文件加载”方法以避免成本部分 正在尝试聚合事件的“STREAMING_INSERT”方法,因为它需要实时可用 下面的代码

我试图按照下面链接中解释的设计实现一个用例,但遇到了错误。任何指点都会大有帮助

逐步解释用例:

  • 从PubSub获取流式原始事件
  • 验证接收到的原始事件
  • 筛选特定类型的事件
  • 创建筛选事件的字典
  • 同时,通过窗口操作传递过滤后的事件并将其聚合
  • 2种类型的输出-原始事件字典、聚合事件字典
  • 按照上面链接中解释的设计,原始事件字典属于低紧急级别,聚合事件属于高紧急级别
  • 尝试对原始事件使用“文件加载”方法以避免成本部分
  • 正在尝试聚合事件的“STREAMING_INSERT”方法,因为它需要实时可用
  • 下面的代码片段:

    p = beam.Pipeline(argv=argv)
        valid_msgs, errors = (p
                              | 'Read from Pubsub' >>
                              beam.io.ReadFromPubSub(subscription=c['SUBSCRIPTION']).with_output_types(bytes)
                              | 'Validate PubSub Event' >> beam.ParDo(ValidateMessages()).with_outputs('errors', main='valid')
                              )
    
        filtered_events = (valid_msgs | 'Filter Events' >> beam.Filter(filter_msgs))
    
        raw_events = (filtered_events | 'Prepare Raw Event Row for BQ ' >> beam.Map(get_raw_values))
    
        agg_events = (filtered_events
                      | f'Streaming Window for {c["WINDOW_TIME"]} seconds' >> beam.WindowInto(window.FixedWindows(c['WINDOW_TIME']))
                      | 'Event Parser' >> beam.Map(get_agg_values)
                      | 'Event Aggregation' >> beam.CombinePerKey(sum)
                      | 'Prepare Aggregate Event Row for BQ' >> beam.Map(get_count)
                      )
    
        # Raw events are written to BigQuery using 'Load Jobs' every 10 minutes.
        write_result_raw = (raw_events | 'Write Raw Events to BQ' >> beam.io.WriteToBigQuery(c["RAW_TABLE"],
                                                                                             project=c["PROJECT"],
                                                                                             dataset=c["DATASET_NAME"],
                                                                                             method='FILE_LOADS',
                                                                                             triggering_frequency=10))
    
        # Aggregated events are written to BigQuery using 'Streaming Inserts'.
        write_result_agg = (agg_events | 'Write Aggregate Results to BQ' >> beam.io.WriteToBigQuery(c["COUNT_TABLE"],
                                                                                                    project=c["PROJECT"],
                                                                                                    dataset=c["DATASET_NAME"],
                                                                                                    create_disposition=CreateDisposition.CREATE_NEVER,
                                                                                                    write_disposition=WriteDisposition.WRITE_APPEND,
                                                                                                    insert_retry_strategy=RetryStrategy.RETRY_ALWAYS))
    
    错误:

    File "/usr/local/lib/python3.6/site-packages/apache_beam/io/gcp/bigquery.py", line 1493, in expand
    42    'triggering_frequency can only be used with '
    43ValueError: triggering_frequency can only be used with FILE_LOADS method of writing to BigQuery.
    
    根据@Iñigo的反应,我添加了这个标志。但它也不起作用。详情请参阅下文

     if c['FILE_LOAD']:
            argv.append('--experiments=use_beam_bq_sink')
    
     p = beam.Pipeline(argv=argv)
     records | 'Write Result to BQ' >> beam.io.WriteToBigQuery(c["RAW_TABLE"],
                                                                      project=c["PROJECT"],
                                                                      dataset=c["DATASET_NAME"],
                                                                      method='FILE_LOADS',
                                                                      triggering_frequency=c['FILE_LOAD_FREQUENCY'],
                                                                      create_disposition=CreateDisposition.CREATE_NEVER,
                                                                      write_disposition=WriteDisposition.WRITE_APPEND,
                                                                      insert_retry_strategy=RetryStrategy.RETRY_ON_TRANSIENT_ERROR
                                                                      )```
    
    Error from dataflow job.
    ```Workflow failed.Causes: Because of the shape of your pipeline, the Cloud Dataflow job optimizer produced a job graph that is not updatable using the - -update pipeline option.This is a known issue that we are working to resolve.See https: // issuetracker.google.com / issues / 118375066 for information about how to modify the shape of your pipeline to avoid this error.You can override this error and force the submission of the job by specifying the --experiments=allow_non_updatable_job parameter., The stateful transform named 'Write Errors to BQ/BigQueryBatchFileLoads/ImpulseSingleElementPC/Map(decode).out/FromValue/ReadStream' is in two or more computations.```
    
    
    EDITS : 08/11/2020
    
    Added both the flags mentioned as the Pipeline argument.
    
    ```INFO:root:Argument to Beam Pipeline:['--project=xxxxxx, '--runner=DataflowRunner', '--job_name=df-pubsub-raw', '--save_main_session', '--staging_location=gs:/staging/', '--temp_location=gs://temp/', '--network=dataflow-localnet', '--subnetwork=regions/us-central1/subnetworks/us-central1', '--region=us-central1', '--service_account_email=xxx@YYY.iam.gserviceaccount.com', '--no_use_public_ips', '--streaming', '--experiments=[allow_non_updatable_job, use_beam_bq_sink]']
    
    INFO:root:File load enabled 
    INFO:root:Write using file load with frequency:5  
    
    26  File "./dataflow_ps_stream_bq.py", line 133, in stream_to_bq  27    write_disposition=WriteDisposition.WRITE_APPEND  28  File "/usr/local/lib/python3.6/site-packages/apache_beam/pvalue.py", line 141, in __or__  29    return self.pipeline.apply(ptransform, self)  30  File "/usr/local/lib/python3.6/site-packages/apache_beam/pipeline.py", line 610, in apply  31    transform.transform, pvalueish, label or transform.label)  32  File "/usr/local/lib/python3.6/site-packages/apache_beam/pipeline.py", line 620, in apply  33    return self.apply(transform, pvalueish)  34  File "/usr/local/lib/python3.6/site-packages/apache_beam/pipeline.py", line 663, in apply  35    pvalueish_result = self.runner.apply(transform, pvalueish, self._options)  36  File "/usr/local/lib/python3.6/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 153, in apply  37    return super(DataflowRunner, self).apply(transform, input, options)  38  File "/usr/local/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 198, in apply  39    return m(transform, input, options)  40  File "/usr/local/lib/python3.6/site-packages/apache_beam/runners/runner.py", line 228, in apply_PTransform  41    return transform.expand(input)  42  File "/usr/local/lib/python3.6/site-packages/apache_beam/io/gcp/bigquery.py", line 1493, in expand  43    'triggering_frequency can only be used with '  44ValueError: triggering_frequency can only be used with FILE_LOADS method of writing to BigQuery.  ```
    
    

    您需要添加标志
    ——实验使用\u beam\u bq\u sink
    。这是一个问题已经有一段时间了,数据流会覆盖负载类型

    你可以在地图上看到这一点

    看起来还有一个PR来改进插入,包括这个(我只是浏览了一下代码)


    附言:很高兴我们的博客文章帮助我们想出了一些想法:D

    在添加标志后也没有运气,遇到了另一个错误。请查看上面的编辑。如错误所示,添加标志
    --experiments=allow\u non\u updateable\u job
    。这是在Python管道以某种方式分支时造成的。由于您使用的是两个实验,我认为您将需要
    --experiments=[允许不可更新的作业,使用beam\bq\u sink]
    也不可能同时添加这两个标志。请参见上面的“编辑:08/11/2020”。我很感谢您在这里的输入。看来,
    bq\u sink
    标志没有被读取。单独做,即,
    --实验=允许不可更新的作业
    --实验=使用波束
    。在我这边和工人那边试过。请注意,您还可以在调用python函数时添加参数:
    python mypipeline.py experiments=allow\u non\u updateable\u job--experiments=use\u beam\u bq\u sink--streaming
    让我试试!谢谢你的快速回复!