Python 在beam.DoFn中调用beam.io.WriteToBigQuery

Python 在beam.DoFn中调用beam.io.WriteToBigQuery,python,google-cloud-platform,google-bigquery,google-cloud-dataflow,apache-beam,Python,Google Cloud Platform,Google Bigquery,Google Cloud Dataflow,Apache Beam,我已经创建了一个带有一些参数的数据流模板。当我将数据写入BigQuery时,我想利用这些参数来确定应该写入哪个表。我已经尝试在ParDo中调用WriteToBigQuery,如下链接所示 管道已成功运行,但未创建数据或将数据加载到BigQuery。你知道可能是什么问题吗 def run(): pipeline_options = PipelineOptions() pipeline_options.view_as(DebugOptions).experiments = ['use_be

我已经创建了一个带有一些参数的数据流模板。当我将数据写入BigQuery时,我想利用这些参数来确定应该写入哪个表。我已经尝试在ParDo中调用WriteToBigQuery,如下链接所示

管道已成功运行,但未创建数据或将数据加载到BigQuery。你知道可能是什么问题吗

def run():
  pipeline_options = PipelineOptions()
  pipeline_options.view_as(DebugOptions).experiments = ['use_beam_bq_sink']

  with beam.Pipeline(options=pipeline_options) as p:
    custom_options = pipeline_options.view_as(CustomOptions)

    _ = (
      p
      | beam.Create([None])
      | 'Year to periods' >> beam.ParDo(SplitYearToPeriod(custom_options.year))
      | 'Read plan data' >> beam.ParDo(GetPlanDataByPeriod(custom_options.secret_name))
      | 'Transform record' >> beam.Map(transform_record)
      | 'Write to BQ' >> beam.ParDo(WritePlanDataToBigQuery(custom_options.year))
    )

if __name__ == '__main__':
  run()

您已经在
DoFn的
过程
方法中实例化了pttransform
beam.io.gcp.bigquery.WriteToBigQuery
。这里有几个问题:

  • 为输入PCollection的每个元素调用
    过程
    方法。它不用于构建管道图。这种动态构建图形的方法将不起作用
  • 将其移出
    DoFn
    后,需要将PTransform
    beam.io.gcp.bigquery.WriteToBigQuery
    应用于PCollection,使其生效。请参阅或
要为表名创建派生值提供程序,需要“嵌套”值提供程序。不幸的是,这是错误的。不过,您可以直接使用值提供程序选项


作为一个高级选项,您可能有兴趣尝试“flex templates”,它将整个程序打包为docker映像,并使用参数执行它。

如果目标是让代码接受参数,而不是表路径的硬编码字符串,那么以下是一种实现方法:

  • 将表格参数添加为CustomOptions
  • 在run函数中添加CustomOptions参数,如下所示: 默认字符串
  • 在shell文件中的管道构造时传递表路径

我注意到,如果我使用
use\u beam\u bq\u sink
标志,我可以将值提供程序直接传递给
beam.io.WriteToBigQuery
表参数。因此,我只是让模板的调用者决定它需要更新到哪个表。我尝试了flex模板,但无法使其工作。我不断收到暂时性错误。编辑了答案:您可以直接使用值提供程序。您无法从值提供程序生成新字符串。
class CustomOptions(PipelineOptions):
  @classmethod
  def _add_argparse_args(cls, parser):
    parser.add_value_provider_argument('--year', type=int)
    parser.add_value_provider_argument('--secret_name', type=str)
class WritePlanDataToBigQuery(beam.DoFn):
  def __init__(self, year_vp):
    self._year_vp = year_vp

  def process(self, element):
    year = self._year_vp.get()

    table = f's4c.plan_data_{year}'
    schema = {
      'fields': [ ...some fields properties ]
    }

    beam.io.WriteToBigQuery(
      table=table,
      schema=schema,
      create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
      write_disposition=BigQueryDisposition.WRITE_TRUNCATE,
      method=beam.io.WriteToBigQuery.Method.FILE_LOADS
    )
...

class CustomOptions(PipelineOptions):
    @classmethod
    def _add_argparse_args(cls, parser):
        parser.add_value_provider_argument(
            '--gcs_input_file_path',
            type=str,
            help='GCS Input File Path'
        )
        parser.add_value_provider_argument(
            '--project_id',
            type=str,
            help='GCP ProjectID'
        )
        parser.add_value_provider_argument(
            '--dataset',
            type=str,
            help='BigQuery DataSet Name'
        )
        parser.add_value_provider_argument(
            '--table',
            type=str,
            help='BigQuery Table Name'
        )

def run(argv=None):

    pipeline_option = PipelineOptions()
    pipeline = beam.Pipeline(options=pipeline_option)
    custom_options = pipeline_option.view_as(CustomOptions)
    pipeline_option.view_as(SetupOptions).save_main_session = True
    pipeline_option.view_as(DebugOptions).experiments = ['use_beam_bq_sink']

    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--gcp_project_id',
        type=str,
        help='GCP ProjectID',
        default=str(custom_options.project_id)
    )
    parser.add_argument(
        '--dataset',
        type=str,
        help='BigQuery DataSet Name',
        default=str(custom_options.dataset)
    )
    parser.add_argument(
        '--table',
        type=str,
        help='BigQuery Table Name',
        default=str(custom_options.table)
    )

    static_options, _ = parser.parse_known_args(argv)
    path = static_options.gcp_project_id + ":" + static_options.dataset + "." + static_options.table

    data = (
            pipeline
            | "Read from GCS Bucket" >>
            beam.io.textio.ReadFromText(custom_options.gcs_input_file_path)
            | "Parse Text File" >>
            beam.ParDo(Split())
            | 'WriteToBigQuery' >>
            beam.io.WriteToBigQuery(
                path,
                schema=Schema,
                create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
            )
    )

    result = pipeline.run()
    result.wait_until_finish()


if __name__ == '__main__':
    logging.getLogger().setLevel(logging.INFO)
    run()
python template.py \
  --dataset dataset_name \
  --table table_name \
  --project project_name \
  --runner DataFlowRunner \
  --region region_name \
  --staging_location gs://bucket_name/staging \
  --temp_location gs://bucket_name/temp \
  --template_location gs://bucket_name/templates/template_name