Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业
im使用气流在GCP dataproc中运行作业。 在每个作业执行之前,使用钩子检查作业是否可以附加到以前执行的作业 附加作业时,dataprock不会执行作业,除非我删除以前的(附加的)作业 由于这个流,我丢失了很多元数据和日志 有没有办法禁用附件 这是钩子中创建附件的代码:Python 3.x 使用Airflow\u DataProcJob挂钩禁用在dataproc中附加作业,python-3.x,google-cloud-platform,airflow,Python 3.x,Google Cloud Platform,Airflow,im使用气流在GCP dataproc中运行作业。 在每个作业执行之前,使用钩子检查作业是否可以附加到以前执行的作业 附加作业时,dataprock不会执行作业,除非我删除以前的(附加的)作业 由于这个流,我丢失了很多元数据和日志 有没有办法禁用附件 这是钩子中创建附件的代码: # There is a small set of states that we will accept as sufficient # for attaching the new tas
# There is a small set of states that we will accept as sufficient
# for attaching the new task instance to the old Dataproc job. We
# generally err on the side of _not_ attaching, unless the prior
# job is in a known-good state. For example, we don't attach to an
# ERRORed job because we want Airflow to be able to retry the job.
# The full set of possible states is here:
# https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/projects.regions.jobs#State
recoverable_states = frozenset([
'PENDING',
'SETUP_DONE',
'RUNNING',
'DONE',
])
found_match = False
for job_on_cluster in jobs_on_cluster:
job_on_cluster_id = job_on_cluster['reference']['jobId']
job_on_cluster_task_id = job_on_cluster_id[:-UUID_LENGTH]
if task_id_to_submit == job_on_cluster_task_id:
self.job = job_on_cluster
self.job_id = self.job['reference']['jobId']
found_match = True
# We can stop looking once we find a matching job in a recoverable state.
if self.job['status']['state'] in recoverable_states:
break
if found_match and self.job['status']['state'] in recoverable_states:
message = """
Reattaching to previously-started DataProc job %s (in state %s).
If this is not the desired behavior (ie if you would like to re-run this job),
please delete the previous instance of the job by running:
gcloud --project %s dataproc jobs delete %s --region %s
"""
此代码已在主分支中重新排列,我无法找到此功能是否仍然存在 提交作业时,DataProc操作员会向作业ID添加一个随机UUID。我看不出代码如何使用旧UUID提交相同的作业ID。但显然是这样。我通过复制Hook类并从“可恢复状态”列表中注释掉“DONE”,解决了这个问题
干杯。我想,没有任何选项可以设置为禁用,而且我在将DAG迁移到Airflow 1.10.11时也遇到了同样的问题。 在同一个气流任务中使用DataProc hook创建多个作业时,在这里,我通过每次设置一个唯一的作业名称来解决问题,该名称始终使检查
if task\u id\u to\u submit==job\u on\u cluster\u task\u id:
False创建新作业
def build_hive_job(self, job_type, **kwargs):
job = self.hook.create_job_template(task_id="HiveJob",
cluster_name=self.cluster,
job_type="hiveJob",
properties=None)
if job_type == "hql":
job.set_job_name(f"HiveTablesCreationJob-{str(uuid.uuid4())}")
job.add_query_uri(kwargs['hql'])
job.add_variables(kwargs['variables'])
elif job_type == "query":
job.set_job_name(f"HiveDatabaseCreationJob-{str(uuid.uuid4())}")
job.add_query(kwargs['queries'])
else:
raise ValueError("Job type {} is not valid".format(job_type))
return job.build()
我通过使用DataprocSubmitJobOperator解决了这个问题
你好您是否在Cloud Composer内部运行气流?您好@muscat,没有im在本地运行气流并连接到GCPHA您在提交作业时是否考虑使用新的
作业id
?
from airflow.providers.google.cloud.operators.dataproc import DataprocSubmitJobOperator
...................................
with DAG('your_dag.py',
schedule_interval = '0 * * * *',
default_args=DEFAULT_DAG_ARGS
) as dag:
PYSPARK_JOB = {
"reference": {"project_id": project},
"placement": {"cluster_name": cluster},
"pyspark_job": {"main_python_file_uri": 'gs://your_bucket/your_app.py'},
}
pyspark_task = DataprocSubmitJobOperator(
task_id="pyspark_task",
job=PYSPARK_JOB,
location='your_region',
project_id= project,
dag = dag)