Python 在beam.DoFn中调用beam.io.WriteToBigQuery
我已经创建了一个带有一些参数的数据流模板。当我将数据写入BigQuery时,我想利用这些参数来确定应该写入哪个表。我已经尝试在ParDo中调用WriteToBigQuery,如下链接所示 管道已成功运行,但未创建数据或将数据加载到BigQuery。你知道可能是什么问题吗Python 在beam.DoFn中调用beam.io.WriteToBigQuery,python,google-cloud-platform,google-bigquery,google-cloud-dataflow,apache-beam,Python,Google Cloud Platform,Google Bigquery,Google Cloud Dataflow,Apache Beam,我已经创建了一个带有一些参数的数据流模板。当我将数据写入BigQuery时,我想利用这些参数来确定应该写入哪个表。我已经尝试在ParDo中调用WriteToBigQuery,如下链接所示 管道已成功运行,但未创建数据或将数据加载到BigQuery。你知道可能是什么问题吗 def run(): pipeline_options = PipelineOptions() pipeline_options.view_as(DebugOptions).experiments = ['use_be
def run():
pipeline_options = PipelineOptions()
pipeline_options.view_as(DebugOptions).experiments = ['use_beam_bq_sink']
with beam.Pipeline(options=pipeline_options) as p:
custom_options = pipeline_options.view_as(CustomOptions)
_ = (
p
| beam.Create([None])
| 'Year to periods' >> beam.ParDo(SplitYearToPeriod(custom_options.year))
| 'Read plan data' >> beam.ParDo(GetPlanDataByPeriod(custom_options.secret_name))
| 'Transform record' >> beam.Map(transform_record)
| 'Write to BQ' >> beam.ParDo(WritePlanDataToBigQuery(custom_options.year))
)
if __name__ == '__main__':
run()
您已经在
DoFn的过程
方法中实例化了pttransformbeam.io.gcp.bigquery.WriteToBigQuery
。这里有几个问题:
- 为输入PCollection的每个元素调用
过程
方法。它不用于构建管道图。这种动态构建图形的方法将不起作用
- 将其移出
DoFn
后,需要将PTransformbeam.io.gcp.bigquery.WriteToBigQuery
应用于PCollection,使其生效。请参阅或
要为表名创建派生值提供程序,需要“嵌套”值提供程序。不幸的是,这是错误的。不过,您可以直接使用值提供程序选项
作为一个高级选项,您可能有兴趣尝试“flex templates”,它将整个程序打包为docker映像,并使用参数执行它。如果目标是让代码接受参数,而不是表路径的硬编码字符串,那么以下是一种实现方法:
- 将表格参数添加为CustomOptions
- 在run函数中添加CustomOptions参数,如下所示:
默认字符串
- 在shell文件中的管道构造时传递表路径
我注意到,如果我使用use\u beam\u bq\u sink
标志,我可以将值提供程序直接传递给beam.io.WriteToBigQuery
表参数。因此,我只是让模板的调用者决定它需要更新到哪个表。我尝试了flex模板,但无法使其工作。我不断收到暂时性错误。编辑了答案:您可以直接使用值提供程序。您无法从值提供程序生成新字符串。
class CustomOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument('--year', type=int)
parser.add_value_provider_argument('--secret_name', type=str)
class WritePlanDataToBigQuery(beam.DoFn):
def __init__(self, year_vp):
self._year_vp = year_vp
def process(self, element):
year = self._year_vp.get()
table = f's4c.plan_data_{year}'
schema = {
'fields': [ ...some fields properties ]
}
beam.io.WriteToBigQuery(
table=table,
schema=schema,
create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=BigQueryDisposition.WRITE_TRUNCATE,
method=beam.io.WriteToBigQuery.Method.FILE_LOADS
)
...
class CustomOptions(PipelineOptions):
@classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--gcs_input_file_path',
type=str,
help='GCS Input File Path'
)
parser.add_value_provider_argument(
'--project_id',
type=str,
help='GCP ProjectID'
)
parser.add_value_provider_argument(
'--dataset',
type=str,
help='BigQuery DataSet Name'
)
parser.add_value_provider_argument(
'--table',
type=str,
help='BigQuery Table Name'
)
def run(argv=None):
pipeline_option = PipelineOptions()
pipeline = beam.Pipeline(options=pipeline_option)
custom_options = pipeline_option.view_as(CustomOptions)
pipeline_option.view_as(SetupOptions).save_main_session = True
pipeline_option.view_as(DebugOptions).experiments = ['use_beam_bq_sink']
parser = argparse.ArgumentParser()
parser.add_argument(
'--gcp_project_id',
type=str,
help='GCP ProjectID',
default=str(custom_options.project_id)
)
parser.add_argument(
'--dataset',
type=str,
help='BigQuery DataSet Name',
default=str(custom_options.dataset)
)
parser.add_argument(
'--table',
type=str,
help='BigQuery Table Name',
default=str(custom_options.table)
)
static_options, _ = parser.parse_known_args(argv)
path = static_options.gcp_project_id + ":" + static_options.dataset + "." + static_options.table
data = (
pipeline
| "Read from GCS Bucket" >>
beam.io.textio.ReadFromText(custom_options.gcs_input_file_path)
| "Parse Text File" >>
beam.ParDo(Split())
| 'WriteToBigQuery' >>
beam.io.WriteToBigQuery(
path,
schema=Schema,
create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
result = pipeline.run()
result.wait_until_finish()
if __name__ == '__main__':
logging.getLogger().setLevel(logging.INFO)
run()
python template.py \
--dataset dataset_name \
--table table_name \
--project project_name \
--runner DataFlowRunner \
--region region_name \
--staging_location gs://bucket_name/staging \
--temp_location gs://bucket_name/temp \
--template_location gs://bucket_name/templates/template_name