Google cloud platform 如何在beam管道中使用气流数据流pythonoperator?
在使用DataFlowPythonOperator之前,我使用的是airflow的BashOperator。它工作得很好。 我的beam管道需要一个参数,下面是我在bash操作符中使用的命令 仅供参考-此梁管道用于将CSV文件转换为拼花地板。Google cloud platform 如何在beam管道中使用气流数据流pythonoperator?,google-cloud-platform,airflow,google-cloud-composer,Google Cloud Platform,Airflow,Google Cloud Composer,在使用DataFlowPythonOperator之前,我使用的是airflow的BashOperator。它工作得很好。 我的beam管道需要一个参数,下面是我在bash操作符中使用的命令 仅供参考-此梁管道用于将CSV文件转换为拼花地板。 python /home/airflow/gcs/pyFile.py --runner DataflowRunner --project my-project --jobname my-job--num-workers 3 --temp_location
python /home/airflow/gcs/pyFile.py --runner DataflowRunner --project my-project --jobname my-job--num-workers 3 --temp_location gs://path/Temp/ --staging_location gs://path/Staging/ --input gs://path/*.txt --odir gs://path/output --ofile current
这些是我必须通过的必要参数,以便使我的梁管道正常工作
现在,我如何在DataFlowPythonOperator中传递这些参数
我试过了,但是我不知道应该在哪里提到所有的参数。
我试过这样的方法:
task1 = DataFlowPythonOperator(
task_id = 'my_task',
py_file = '/home/airflow/gcs/pyfile.py',
gcp_conn_id='google_cloud_default',
options={
"num-workers" : 3,
"input" : 'gs://path/*.txt',
"odir" : 'gs://path/',
"ofile" : 'current',
"jobname" : 'my-job'
},
dataflow_default_options={
"project": 'my-project',
"staging_location": 'gs://path/Staging/',
"temp_location": 'gs://path/Temp/',
},
dag=dag
)
对于当前脚本(尽管我不确定其格式是否正确),以下是我在日志中得到的内容:
[2020-03-06 05:08:48,070] {base_task_runner.py:115} INFO - Job 810: Subtask my_task [2020-03-06 05:08:48,070] {cli.py:545} INFO - Running <TaskInstance: test-df-po.my_task 2020-02-29T00:00:00+00:00 [running]> on host airflow-worker-69b88ff66d-5wwrn
[2020-03-06 05:08:48,245] {taskinstance.py:1059} ERROR - 'int' object has no attribute '__len__'
Traceback (most recent call last)
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_tas
result = task_copy.execute(context=context
File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 381, in execut
self.py_file, self.py_options
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 240, in start_python_dataflo
label_formatter
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 368, in wrappe
return func(self, *args, **kwargs
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 197, in _start_dataflo
cmd = command_prefix + self._build_cmd(variables, label_formatter
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 266, in _build_cm
elif value is None or value.__len__() < 1
AttributeError: 'int' object has no attribute '__len__
[2020-03-06 05:08:48,247] {base_task_runner.py:115} INFO - Job 810: Subtask my_task [2020-03-06 05:08:48,245] {taskinstance.py:1059} ERROR - 'int' object has no attribute '__len__'
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task Traceback (most recent call last):
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_task
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task result = task_copy.execute(context=context)
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 381, in execute
[2020-03-06 05:08:48,248] {base_task_runner.py:115} INFO - Job 810: Subtask my_task self.py_file, self.py_options)
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 240, in start_python_dataflow
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task label_formatter)
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 368, in wrapper
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task return func(self, *args, **kwargs)
[2020-03-06 05:08:48,249] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 197, in _start_dataflow
[2020-03-06 05:08:48,250] {base_task_runner.py:115} INFO - Job 810: Subtask my_task cmd = command_prefix + self._build_cmd(variables, label_formatter)
[2020-03-06 05:08:48,250] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 266, in _build_cmd
[2020-03-06 05:08:48,251] {base_task_runner.py:115} INFO - Job 810: Subtask my_task elif value is None or value.__len__() < 1:
[2020-03-06 05:08:48,251] {taskinstance.py:1082} INFO - Marking task as UP_FOR_RETRY
[2020-03-06 05:08:48,253] {base_task_runner.py:115} INFO - Job 810: Subtask my_task AttributeError: 'int' object has no attribute '__len__'
[2020-03-06 05:08:48,254] {base_task_runner.py:115} INFO - Job 810: Subtask my_task [2020-03-06 05:08:48,251] {taskinstance.py:1082} INFO - Marking task as UP_FOR_RETRY
[2020-03-06 05:08:48,331] {base_task_runner.py:115} INFO - Job 810: Subtask my_task Traceback (most recent call last):
[2020-03-06 05:08:48,332] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/bin/airflow", line 7, in <module>
[2020-03-06 05:08:48,334] {base_task_runner.py:115} INFO - Job 810: Subtask my_task exec(compile(f.read(), __file__, 'exec'))
[2020-03-06 05:08:48,334] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/bin/airflow", line 37, in <module>
[2020-03-06 05:08:48,334] {base_task_runner.py:115} INFO - Job 810: Subtask my_task args.func(args)
[2020-03-06 05:08:48,335] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/utils/cli.py", line 74, in wrapper
[2020-03-06 05:08:48,336] {base_task_runner.py:115} INFO - Job 810: Subtask my_task return f(*args, **kwargs)
[2020-03-06 05:08:48,336] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/bin/cli.py", line 551, in run
[2020-03-06 05:08:48,337] {base_task_runner.py:115} INFO - Job 810: Subtask my_task _run(args, dag, ti)
[2020-03-06 05:08:48,338] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/bin/cli.py", line 469, in _run
[2020-03-06 05:08:48,338] {base_task_runner.py:115} INFO - Job 810: Subtask my_task pool=args.pool,
[2020-03-06 05:08:48,339] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/utils/db.py", line 74, in wrapper
[2020-03-06 05:08:48,340] {base_task_runner.py:115} INFO - Job 810: Subtask my_task return func(*args, **kwargs)
[2020-03-06 05:08:48,341] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_task
[2020-03-06 05:08:48,342] {base_task_runner.py:115} INFO - Job 810: Subtask my_task result = task_copy.execute(context=context)
[2020-03-06 05:08:48,342] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/operators/dataflow_operator.py", line 381, in execute
[2020-03-06 05:08:48,343] {base_task_runner.py:115} INFO - Job 810: Subtask my_task self.py_file, self.py_options)
[2020-03-06 05:08:48,343] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 240, in start_python_dataflow
[2020-03-06 05:08:48,344] {base_task_runner.py:115} INFO - Job 810: Subtask my_task label_formatter)
[2020-03-06 05:08:48,345] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_api_base_hook.py", line 368, in wrapper
[2020-03-06 05:08:48,345] {base_task_runner.py:115} INFO - Job 810: Subtask my_task return func(self, *args, **kwargs)
[2020-03-06 05:08:48,346] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 197, in _start_dataflow
[2020-03-06 05:08:48,347] {base_task_runner.py:115} INFO - Job 810: Subtask my_task cmd = command_prefix + self._build_cmd(variables, label_formatter)
[2020-03-06 05:08:48,349] {base_task_runner.py:115} INFO - Job 810: Subtask my_task File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataflow_hook.py", line 266, in _build_cmd
[2020-03-06 05:08:48,350] {base_task_runner.py:115} INFO - Job 810: Subtask my_task elif value is None or value.__len__() < 1:
[2020-03-06 05:08:48,350] {base_task_runner.py:115} INFO - Job 810: Subtask my_task AttributeError: 'int' object has no attribute '__len__'
[2020-03-06 05:08:48,638] {helpers.py:308} INFO - Sending Signals.SIGTERM to GPID 8481
[2020-03-06 05:08:48,697] {helpers.py:286} INFO - Process psutil.Process(pid=8481, status='terminated') (8481) terminated with exit code -15
[2020-03-06 05:08:48070]{base_task_runner.py:115}信息-作业810:子任务我的任务[2020-03-06 05:08:48070]{cli.py:545}信息-在主机上运行airflow-worker-69b88ff66d-5wwrn
[2020-03-06 05:08:48245]{taskinstance.py:1059}错误-“int”对象没有属性“\uu len\uu”
回溯(最近一次呼叫最后一次)
文件“/usr/local/lib/aiffair/aiffair/models/taskinstance.py”,第930行,在原始tas中
结果=任务\复制.执行(上下文=上下文
文件“/usr/local/lib/aiffair/aiffair/contrib/operators/dataflow_operator.py”,执行部分第381行
self.py_文件,self.py_选项
文件“/usr/local/lib/afflow/afflow/contrib/hooks/gcp_dataflow_hook.py”,第240行,在start_python_dataflo中
标签格式化程序
文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_-api_-base_-hook.py”,第368行,在wrappe中
返回函数(self,*args,**kwargs)
文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_dataflow_hook.py”,第197行,在_start_dataflo中
cmd=command\u prefix+self.\u build\u cmd(变量、标签\u格式化程序
文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_dataflow_hook.py”,第266行,in_build_cm
elif值为None或value。_len__()
AttributeError:“int”对象没有属性“\uu”__
[2020-03-06 05:08:48247]{base_task_runner.py:115}信息-作业810:子任务my_task[2020-03-06 05:08:48245]{taskinstance.py:1059}错误-'int'对象没有属性''
[2020-03-06 05:08:48248]{base_task_runner.py:115}信息-作业810:子任务my_task回溯(最后一次调用):
[2020-03-06 05:08:48248]{base_task_runner.py:115}INFO-Job 810:Subtask my_task File”/usr/local/lib/aiffair/aiffair/models/taskinstance.py“,第930行,在_run_raw_task中
[2020-03-06 05:08:48248]{base_task_runner.py:115}INFO-Job 810:Subtask my_task result=task_copy.execute(context=context)
[2020-03-06 05:08:48248]{base_task_runner.py:115}信息-作业810:子任务my_任务文件“/usr/local/lib/aiffair/aiffair/contrib/operators/dataflow_operator.py”,执行中第381行
[2020-03-06 05:08:48248]{base_task_runner.py:115}信息-作业810:子任务my_task self.py_文件,self.py_选项)
[2020-03-06 05:08:48249]{base_task_runner.py:115}INFO-Job 810:Subtask my_task File“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_dataflow_hook.py”,第240行,在start_python_dataflow中
[2020-03-06 05:08:48249]{base_task_runner.py:115}INFO-Job 810:Subtask my_task label_formatter)
[2020-03-06 05:08:48249]{base_task_runner.py:115}信息-作业810:子任务我的任务文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_api_base_hook.py”,第368行,在包装器中
[2020-03-06 05:08:48249]{base_task_runner.py:115}INFO-Job 810:Subtask my_task return func(self,*args,**kwargs)
[2020-03-06 05:08:48249]{base_task_runner.py:115}信息-作业810:子任务我的任务文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_dataflow_hook.py”,第197行,在启动数据流中
[2020-03-06 05:08:48250]{base_task_runner.py:115}INFO-Job 810:Subtask my_task cmd=command_prefix+self.\u build_cmd(变量、标签和格式化程序)
[2020-03-06 05:08:48250]{base_task_runner.py:115}信息-作业810:子任务我的任务文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_dataflow_hook.py”,第266行,在构建命令中
[2020-03-06 05:08:48251]{base_task_runner.py:115}信息-作业810:子任务my_task elif值为None或value.\uu len_u()
[2020-03-06 05:08:48251]{taskinstance.py:1082}信息-将任务标记为UP\u以供重试
[2020-03-06 05:08:48253]{base_task_runner.py:115}信息-作业810:子任务my_task AttributeError:'int'对象没有属性''
[2020-03-06 05:08:48254]{base_task_runner.py:115}信息-作业810:子任务my_task[2020-03-06 05:08:48251]{taskinstance.py:1082}信息-将任务标记为UP_以供重试
[2020-03-06 05:08:48331]{base_task_runner.py:115}信息-作业810:子任务my_task回溯(最后一次调用):
[2020-03-06 05:08:48332]{base_task_runner.py:115}INFO-Job 810:Subtask my_task File”/usr/local/bin/afflow“,第7行,在
[2020-03-06 05:08:48334]{base_task_runner.py:115}INFO-Job 810:Subtask my_task exec(compile(f.read(),_文件__,'exec'))
[2020-03-06 05:08:48334]{base_task_runner.py:115}信息-作业810:子任务我的任务文件“/usr/local/lib/aiffair/aiffair/bin/aiffair”,第37行,在
[2020-03-06 05:08:48334]{base_task_runner.py:115}信息-作业810:子任务my_task args.func(args)
[2020-03-06 05:08:48335]{base_task_runner.py:115}INFO-Job 810:Subtask my_task File”/usr/local/lib/aiffair/aiffair/utils/cli.py“,包装器中的第74行
[2020-03-06 05:08:48336]{base_task_runner.py:115}信息-作业810:子任务my_task return f(*args,**kwargs)
[2020-03-06 05:08:48336]{base_task_runner.py:115}信息-作业810:子任务我的任务文件“/usr/local/lib/aiffair/aiffair/bin/cli.py”,第551行,运行中
[2020-03-06 05:08:48337]{base_task_runner.py:115}信息-作业810:子任务my_task_run(args、dag、ti)
[2020-03-06 05:08:48338]{base_task_runner.py:115}INFO-Job 810:Subtask my_task File”/usr/local/lib/aiffair/aiffair/bin/cli.py“,第469行,运行中
[2020-03-06 05:08:48338]{base_task_runner.py:115}信息-作业810:子任务my_task p
options={
"num-workers" : '3',
"input" : 'gs://path/*.txt',
"odir" : 'gs://path/',
"ofile" : 'current'
},
@staticmethod
def _build_cmd(variables, label_formatter):
command = ["--runner=DataflowRunner"]
if variables is not None:
for attr, value in variables.items():
if attr == 'labels':
command += label_formatter(value)
elif value is None or value.__len__() < 1:
command.append("--" + attr)
else:
command.append("--" + attr + "=" + value)
return command