Airflow 并行运行任务

Airflow 并行运行任务,airflow,Airflow,我对并行运行两个任务的工作方式感到困惑 这是我的Dag: import datetime as dt from airflow import DAG import os from airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperator, BranchPythonOperator from airflow.contrib.sens

我对并行运行两个任务的工作方式感到困惑

这是我的Dag:

import datetime as dt
from airflow import DAG
import os
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator, BranchPythonOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dagrun_operator import TriggerDagRunOperator

scriptAirflow = '/home/alexw/scriptAirflow/'
uploadPath='/apps/man-data/data/to_load/'
receiptPath= '/apps/man-data/data/to_receipt/'

def result():
    if(os.listdir(receiptPath)):
        for files in os.listdir(receiptPath):
            if files.startswith('MEM') and files.endswith('.csv'):
                return 'mem_script'
                pass
                print('Launching script for: '+files)
            elif files.startswith('FMS') and files.endswith('.csv'):
                return 'fms_script'
                pass
            else:
                pass   
    else:
        print('No script to launch')
        return "no_script"
        pass

def onlyCsvFiles():
    if(os.listdir(uploadPath)):
        for files in os.listdir(uploadPath):
            if files.startswith('MEM') or files.startswith('FMS') and files.endswith('.csv'):
                return 'move_good_file'
            else:
                return 'move_bad_file'
    else:
        pass

default_args = {
    'owner': 'testingA',
    'start_date': dt.datetime(2020, 2, 17),
    'retries': 1,
}

dag = DAG('tryingAirflow', default_args=default_args, description='airflow20',
          schedule_interval=None, catchup=False)

file_sensor = FileSensor(
    task_id="file_sensor",
    filepath=uploadPath,
    fs_conn_id='airflow_db',
    poke_interval=10,
    dag=dag,
)

onlyCsvFiles=BranchPythonOperator(
    task_id='only_csv_files',
    python_callable=onlyCsvFiles,
    trigger_rule='none_failed',
    dag=dag,)

move_good_file = BashOperator(
    task_id="move_good_file",
    bash_command='python3 '+scriptAirflow+'movingGoodFiles.py "{{ execution_date }}"',
    dag=dag,
)
move_bad_file = BashOperator(
    task_id="move_bad_file",
    bash_command='python3 '+scriptAirflow+'movingBadFiles.py "{{ execution_date }}"',
    dag=dag,
)
result_mv = BranchPythonOperator(
    task_id='result_mv',
    python_callable=result,
    trigger_rule='none_failed',
    dag=dag,
)
run_Mem_Script = BashOperator(
    task_id="mem_script",
    bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}"',
    dag=dag,
)
run_Fms_Script = BashOperator(
    task_id="fms_script",
    bash_command='python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}"',
    dag=dag,
)
skip_script= BashOperator(
    task_id="no_script",
    bash_command="echo No script to launch",
    dag=dag,
)

rerun_dag=TriggerDagRunOperator(
    task_id='rerun_dag',
    trigger_dag_id='tryingAirflow',
    trigger_rule='none_failed',
    dag=dag,
)

onlyCsvFiles.set_upstream(file_sensor)
onlyCsvFiles.set_upstream(file_sensor)
move_good_file.set_upstream(onlyCsvFiles)
move_bad_file.set_upstream(onlyCsvFiles)
result_mv.set_upstream(move_good_file)
result_mv.set_upstream(move_bad_file)
run_Fms_Script.set_upstream(result_mv)
run_Mem_Script.set_upstream(result_mv)
skip_script.set_upstream(result_mv)
rerun_dag.set_upstream(run_Fms_Script)
rerun_dag.set_upstream(run_Mem_Script)
rerun_dag.set_upstream(skip_script)
在选择结果中的任务时,如果必须调用这两个任务,则只执行一个任务并跳过另一个任务

必要时,我想同时执行这两项任务。为我的airflow.cfg。问题是:如何使用BranchPythonOperator并行运行任务(如果没有必要,也可以不并行运行)


谢谢你的帮助

如果您确实想同时运行两个脚本或无脚本,我会在需要并行运行的两个任务之前添加一个虚拟任务。使用
BranchPythonOperator
时,气流将始终选择一个分支执行

我会作出以下改变:

# import the DummyOperator
from airflow.operators.dummy_operator import DummyOperator

# modify the returns of the function result()
def result():
    if(os.listdir(receiptPath)):
        for files in os.listdir(receiptPath):
            if (files.startswith('MEM') and files.endswith('.csv') or 
                files.startswith('FMS') and files.endswith('.csv')):
                return 'run_scripts'
    else:
        print('No script to launch')
        return "no_script"

# add the dummy task
run_scripts = DummyOperator(
    task_id="run_scripts",
    dag=dag
)

# add dependency
run_scripts.set_upstream(result_mv)

# CHANGE two of the dependencies to
run_Fms_Script.set_upstream(run_scripts)
run_Mem_Script.set_upstream(run_scripts)
我必须承认,我从未使用过
LocalExecutor
处理并行任务,但这应该确保您在运行脚本时同时运行这两个任务

编辑:

如果您想运行none、one of the two或All,我认为最简单的方法是创建另一个任务,在bash中并行运行这两个脚本(或者至少与
&
一起运行)。我会这样做:

# import the DummyOperator
from airflow.operators.dummy_operator import DummyOperator

# modify the returns of the function result() so that it chooses between 4 different outcomes
def result():
    if(os.listdir(receiptPath)):
        mem_flag = False
        fms_flag = False
        for files in os.listdir(receiptPath):
            if (files.startswith('MEM') and files.endswith('.csv')):
                mem_flag = True
            if (files.startswith('FMS') and files.endswith('.csv')):
                fms_flag = True
        if mem_flag and fms_flag:
            return "both_scripts"
        elif mem_flag:
            return "mem_script"
        elif fms_flag:
            return "fms_script"
        else:
            return "no_script"
    else:
        print('No script to launch')
        return "no_script"

# add the 'run both scripts' task
run_both_scripts = BashOperator(
    task_id="both_script",
    bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}" & python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}" &',
    dag=dag,
)

# add dependency
run_both_scripts.set_upstream(result_mv)   

很抱歉,我回答了,但我认为我的回答不会像现在这样起作用,因为
result()
函数返回的是什么。我可以编辑我的答案,但首先我想更好地理解你想要实现的目标。在分支之后,您希望不运行任何内容,或者同时运行两个,或者只运行两个
xxx\u脚本
任务中的一个?mem和fms启动脚本,因此如果文件夹中有文件,result\u mv必须启动fms或mem或两者。但当两者都有时,它只运行fms或mem,而不是bothIt,但正如我试图解释的,它总是通过执行两个脚本来工作,即使只有一个条件得到满足。这种行为被接受了吗?事实上,它会同时执行这两个任务,我只想运行相关的任务并跳过另一个任务。然后这将需要一种更复杂的方法,今晚我将尝试扩展我的答案。我修改了答案,以考虑运行任何一个任务的可能性,在bash命令在后台运行该命令后,脚本中的两个或两个都没有。因此,当您最后使用
&
运行这两个命令时,它们都将在后台运行,因此基本上它们将“并行”运行。有关更多上下文,请参阅。告诉我这是否有效!:)