Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/python-3.x/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业_Python 3.x_Google Cloud Platform_Airflow - Fatal编程技术网

Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业

Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业,python-3.x,google-cloud-platform,airflow,Python 3.x,Google Cloud Platform,Airflow,im使用气流在GCP dataproc中运行作业。 在每个作业执行之前,使用钩子检查作业是否可以附加到以前执行的作业 附加作业时,dataprock不会执行作业,除非我删除以前的(附加的)作业 由于这个流,我丢失了很多元数据和日志 有没有办法禁用附件 这是钩子中创建附件的代码: # There is a small set of states that we will accept as sufficient # for attaching the new tas

im使用气流在GCP dataproc中运行作业。 在每个作业执行之前,使用钩子检查作业是否可以附加到以前执行的作业

附加作业时,dataprock不会执行作业,除非我删除以前的(附加的)作业

由于这个流,我丢失了很多元数据和日志

有没有办法禁用附件

这是钩子中创建附件的代码:

        # There is a small set of states that we will accept as sufficient
        # for attaching the new task instance to the old Dataproc job.  We
        # generally err on the side of _not_ attaching, unless the prior
        # job is in a known-good state. For example, we don't attach to an
        # ERRORed job because we want Airflow to be able to retry the job.
        # The full set of possible states is here:
        # https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs#State
        recoverable_states = frozenset([
            'PENDING',
            'SETUP_DONE',
            'RUNNING',
            'DONE',
        ])

        found_match = False
        for job_on_cluster in jobs_on_cluster:
            job_on_cluster_id = job_on_cluster['reference']['jobId']
            job_on_cluster_task_id = job_on_cluster_id[:-UUID_LENGTH]
            if task_id_to_submit == job_on_cluster_task_id:

                self.job = job_on_cluster
                self.job_id = self.job['reference']['jobId']
                found_match = True

                # We can stop looking once we find a matching job in a recoverable state.
                if self.job['status']['state'] in recoverable_states:
                    break

        if found_match and self.job['status']['state'] in recoverable_states:
            message = """
    Reattaching to previously-started DataProc job %s (in state %s).
    If this is not the desired behavior (ie if you would like to re-run this job),
    please delete the previous instance of the job by running:

    gcloud --project %s dataproc jobs delete %s --region %s
"""

此代码已在主分支中重新排列,我无法找到此功能是否仍然存在

提交作业时,DataProc操作员会向作业ID添加一个随机UUID。我看不出代码如何使用旧UUID提交相同的作业ID。但显然是这样。我通过复制Hook类并从“可恢复状态”列表中注释掉“DONE”,解决了这个问题


干杯。

我想,没有任何选项可以设置为禁用,而且我在将DAG迁移到Airflow 1.10.11时也遇到了同样的问题。 在同一个气流任务中使用DataProc hook创建多个作业时,在这里,我通过每次设置一个唯一的作业名称来解决问题,该名称始终使检查
if task\u id\u to\u submit==job\u on\u cluster\u task\u id:
False创建新作业

    def build_hive_job(self, job_type, **kwargs):
    job = self.hook.create_job_template(task_id="HiveJob",
                                        cluster_name=self.cluster,
                                        job_type="hiveJob",
                                        properties=None)
    if job_type == "hql":
        job.set_job_name(f"HiveTablesCreationJob-{str(uuid.uuid4())}")
        job.add_query_uri(kwargs['hql'])
        job.add_variables(kwargs['variables'])
    elif job_type == "query":
        job.set_job_name(f"HiveDatabaseCreationJob-{str(uuid.uuid4())}")
        job.add_query(kwargs['queries'])
    else:
        raise ValueError("Job type {} is not valid".format(job_type))

    return job.build()

我通过使用DataprocSubmitJobOperator解决了这个问题


你好您是否在Cloud Composer内部运行气流?您好@muscat,没有im在本地运行气流并连接到GCPHA您在提交作业时是否考虑使用新的
作业id
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator    
        ...................................
    with DAG('your_dag.py',
                 schedule_interval = '0 * * * *', 
                 default_args=DEFAULT_DAG_ARGS
                ) as dag:
    
                PYSPARK_JOB = {
                "reference": {"project_id": project},
                "placement": {"cluster_name": cluster},
                "pyspark_job": {"main_python_file_uri": 'gs://your_bucket/your_app.py'},
                }
                
                pyspark_task = DataprocSubmitJobOperator(
                            task_id="pyspark_task",
                            job=PYSPARK_JOB,
                            location='your_region',
                            project_id= project,
                            dag = dag)