Airflow 编写器工作流在dataproc运算符处失败

Airflow 编写器工作流在dataproc运算符处失败,airflow,google-cloud-dataproc,google-cloud-composer,Airflow,Google Cloud Dataproc,Google Cloud Composer,我在gcp中设置了一个composer环境,它运行一个DAG,如下所示 with DAG('sample-dataproc-dag', default_args=DEFAULT_DAG_ARGS, schedule_interval=None) as dag: # Here we are using dag as context # Submit the PySpark job. submit_pyspark = DataProcPySparkOperator(

我在gcp中设置了一个composer环境,它运行一个DAG,如下所示

with DAG('sample-dataproc-dag',
     default_args=DEFAULT_DAG_ARGS,
     schedule_interval=None) as dag:  # Here we are using dag as context


# Submit the PySpark job.
submit_pyspark = DataProcPySparkOperator(
    task_id='run_dataproc_pyspark',
    main='gs://.../dataprocjob.py',  
    cluster_name='xyz',
    dataproc_pyspark_jars=
    'gs://.../spark-bigquery-latest_2.12.jar'
    )


simple_bash = BashOperator(
    task_id='simple-bash',
    bash_command="ls -la")

submit_pyspark.dag = dag
submit_pyspark.set_upstream(simple_bash)
这是我的dataprocjob.py

from pyspark.sql import SparkSession



if __name__ == '__main__':

spark = SparkSession.builder.appName('Jupyter BigQuery Storage').getOrCreate()
table = "projct.dataset.txn_w_ah_demo"
df = spark.read.format("bigquery").option("table",table).load()
df.printSchema()
我的编写器管道在dataproc步骤失败。在gcs中存储的composer日志中,我看到的是

[2020-09-23 21:40:02,849] {taskinstance.py:1059} ERROR - <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">@-@{"workflow": "sample-dataproc-dag", "task-id": "run_dataproc_pyspark", "execution-date": "2020-09-23T21:39:42.371933+00:00"}
Traceback (most recent call last):
File "/usr/local/lib/airflow/airflow/models/taskinstance.py", line 930, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py", line 1139, in execute
super(DataProcPySparkOperator, self).execute(context)
File "/usr/local/lib/airflow/airflow/contrib/operators/dataproc_operator.py", line 707, in execute
self.hook.submit(self.hook.project_id, self.job, self.region, self.job_error_states)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py", line 311, in submit
num_retries=self.num_retries)
File "/usr/local/lib/airflow/airflow/contrib/hooks/gcp_dataproc_hook.py", line 51, in __init__
clusterName=cluster_name).execute()
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/_helpers.py", line 130, in positional_wrapper
return wrapped(*args, **kwargs)
File "/opt/python3.6/lib/python3.6/site-packages/googleapiclient/http.py", line 851, in execute
raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://dataproc.googleapis.com/v1beta2/projects/lt-dia-pop-dis-upr/regions/global/jobs?clusterName=dppoppr004&alt=json returned "Not authorized to requested resource.">
[2020-09-23 21:40:02849]{taskinstance.py:1059}错误-@-{“工作流”:“示例数据处理dag”,“任务id”:“运行数据处理pyspark”,“执行日期”:“2020-09-23T21:39:42.371933+00:00”
回溯(最近一次呼叫最后一次):
文件“/usr/local/lib/afflow/afflow/models/taskinstance.py”,第930行,在原始任务中
结果=任务\复制.执行(上下文=上下文)
文件“/usr/local/lib/aiffair/aiffair/contrib/operators/dataproc_operator.py”,执行中的第1139行
超级(DataProcPySparkOperator,self).execute(上下文)
文件“/usr/local/lib/aiffair/aiffair/contrib/operators/dataproc_operator.py”,执行中的第707行
self.hook.submit(self.hook.project\u id、self.job、self.region、self.job\u error\u states)
文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_dataproc_hook.py”,第311行,提交
num\u retries=self.num\u retries)
文件“/usr/local/lib/aiffair/aiffair/contrib/hooks/gcp_dataproc_hook.py”,第51行,in_u_init__
clusterName=cluster_name).execute()
位置包装中的文件“/opt/python3.6/lib/python3.6/site packages/googleapiclient/_helpers.py”,第130行
已包装退货(*args,**kwargs)
文件“/opt/python3.6/lib/python3.6/site packages/googleapiclient/http.py”,执行中的第851行
raise HttpError(resp,content,uri=self.uri)
GoogleAppClient.errors.HttpError:

从第一次阅读来看,您调用Dataproc API的Google云帐户上的权限似乎不适合操作员。

您提出的问题似乎与您授予应用程序的Dataproc权限相对应

根据,您需要不同的角色权限来执行Dataproc任务,例如:

dataproc.clusters.create permits the creation of Cloud Dataproc clusters in the containing project
dataproc.jobs.create permits the submission of Dataproc jobs to Dataproc clusters in the containing project
dataproc.clusters.list permits the listing of details of Dataproc clusters in the containing project
如果要创建提交dataproc作业,则需要“dataproc.clusters.use”和“dataproc.jobs.create”权限


为了向您的用户帐户授予正确的权限,您可以按照更新代码中使用的服务帐户并添加正确的权限。

如何向帐户授予足够的权限?这将通过Google Console凭据进行,并取决于凭据的创建方式