Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业_Python 3.x_Google Cloud Platform_Airflow

Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业

python-3.x google-cloud-platform airflow

Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业,python-3.x,google-cloud-platform,airflow,Python 3.x,Google Cloud Platform,Airflow,im使用气流在GCP dataproc中运行作业。在每个作业执行之前，使用钩子检查作业是否可以附加到以前执行的作业附加作业时，dataprock不会执行作业，除非我删除以前的（附加的）作业由于这个流，我丢失了很多元数据和日志有没有办法禁用附件这是钩子中创建附件的代码： # There is a small set of states that we will accept as sufficient # for attaching the new tas

im使用气流在GCP dataproc中运行作业。在每个作业执行之前，使用钩子检查作业是否可以附加到以前执行的作业

附加作业时，dataprock不会执行作业，除非我删除以前的（附加的）作业

由于这个流，我丢失了很多元数据和日志

有没有办法禁用附件

这是钩子中创建附件的代码：

        # There is a small set of states that we will accept as sufficient
        # for attaching the new task instance to the old Dataproc job.  We
        # generally err on the side of _not_ attaching, unless the prior
        # job is in a known-good state. For example, we don't attach to an
        # ERRORed job because we want Airflow to be able to retry the job.
        # The full set of possible states is here:
        # https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs#State
        recoverable_states = frozenset([
            'PENDING',
            'SETUP_DONE',
            'RUNNING',
            'DONE',
        ])

        found_match = False
        for job_on_cluster in jobs_on_cluster:
            job_on_cluster_id = job_on_cluster['reference']['jobId']
            job_on_cluster_task_id = job_on_cluster_id[:-UUID_LENGTH]
            if task_id_to_submit == job_on_cluster_task_id:

                self.job = job_on_cluster
                self.job_id = self.job['reference']['jobId']
                found_match = True

                # We can stop looking once we find a matching job in a recoverable state.
                if self.job['status']['state'] in recoverable_states:
                    break

        if found_match and self.job['status']['state'] in recoverable_states:
            message = """
    Reattaching to previously-started DataProc job %s (in state %s).
    If this is not the desired behavior (ie if you would like to re-run this job),
    please delete the previous instance of the job by running:

    gcloud --project %s dataproc jobs delete %s --region %s
"""

此代码已在主分支中重新排列，我无法找到此功能是否仍然存在

提交作业时，DataProc操作员会向作业ID添加一个随机UUID。我看不出代码如何使用旧UUID提交相同的作业ID。但显然是这样。我通过复制Hook类并从“可恢复状态”列表中注释掉“DONE”，解决了这个问题

干杯。

我想，没有任何选项可以设置为禁用，而且我在将DAG迁移到Airflow 1.10.11时也遇到了同样的问题。在同一个气流任务中使用DataProc hook创建多个作业时，在这里，我通过每次设置一个唯一的作业名称来解决问题，该名称始终使检查

if task\u id\u to\u submit==job\u on\u cluster\u task\u id:

False创建新作业

    def build_hive_job(self, job_type, **kwargs):
    job = self.hook.create_job_template(task_id="HiveJob",
                                        cluster_name=self.cluster,
                                        job_type="hiveJob",
                                        properties=None)
    if job_type == "hql":
        job.set_job_name(f"HiveTablesCreationJob-{str(uuid.uuid4())}")
        job.add_query_uri(kwargs['hql'])
        job.add_variables(kwargs['variables'])
    elif job_type == "query":
        job.set_job_name(f"HiveDatabaseCreationJob-{str(uuid.uuid4())}")
        job.add_query(kwargs['queries'])
    else:
        raise ValueError("Job type {} is not valid".format(job_type))

    return job.build()

我通过使用DataprocSubmitJobOperator解决了这个问题

你好您是否在Cloud Composer内部运行气流？您好@muscat，没有im在本地运行气流并连接到GCPHA您在提交作业时是否考虑使用新的

作业id

？

from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator    
        ...................................
    with DAG('your_dag.py',
                 schedule_interval = '0 * * * *', 
                 default_args=DEFAULT_DAG_ARGS
                ) as dag:
    
                PYSPARK_JOB = {
                "reference": {"project_id": project},
                "placement": {"cluster_name": cluster},
                "pyspark_job": {"main_python_file_uri": 'gs://your_bucket/your_app.py'},
                }
                
                pyspark_task = DataprocSubmitJobOperator(
                            task_id="pyspark_task",
                            job=PYSPARK_JOB,
                            location='your_region',
                            project_id= project,
                            dag = dag)