使用Python创建动态任务

使用Python创建动态任务,python,airflow,Python,Airflow,我正在尝试创建一个动态工作流。 我有这个: 我尝试使用BashOperator(调用python脚本)动态创建任务 我的达格: import datetime as dt from airflow import DAG import shutil import os from airflow.operators.bash_operator import BashOperator from airflow.operators.python_operator import PythonOperat

我正在尝试创建一个动态工作流。 我有这个:

我尝试使用BashOperator(调用python脚本)动态创建任务

我的达格:

import datetime as dt
from airflow import DAG
import shutil
import os
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator, BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dagrun_operator import TriggerDagRunOperator

scriptAirflow = '/home/alexw/scriptAirflow/'
uploadPath='/apps/lv-manuf2020-data/80_DATA/00_Loading/'
receiptPath= '/apps/lv-manuf2020-data/80_DATA/01_Receipt/'
fmsFiles=[]
memFiles=[]

def onlyCsvFiles():
    if(os.listdir(uploadPath)):
        for files in os.listdir(uploadPath):    
            if(files.startswith('MEM') and files.endswith('.csv') or files.startswith('FMS') and files.endswith('.csv')):
                shutil.move(uploadPath+files, receiptPath)
                print(files+' moved in ' + receiptPath+files)
        for files in os.listdir(receiptPath):
            if(files.startswith('MEM') and files.endswith('.csv') or files.startswith('FMS') and files.endswith('.csv')):
                return "run_scripts"
            else:
                return "no_script"
    else:
        print('No file in upload_00')



default_args = {
    'owner': 'manuf2020',
    'start_date': dt.datetime(2020, 2, 17),
    'retries': 1,
}


dag = DAG('lv-manuf2020', default_args=default_args, description='airflow_manuf2020',
          schedule_interval=None, catchup=False)


file_sensor = FileSensor(
    task_id="file_sensor",
    filepath=uploadPath,
    fs_conn_id='airflow_db',
    poke_interval=10,
    dag=dag,
)


move_csv = BranchPythonOperator(
    task_id='move_csv',
    python_callable=onlyCsvFiles,
    trigger_rule='none_failed',
    dag=dag,
)

run_scripts = DummyOperator(
    task_id="run_scripts",
    dag=dag
)

no_script= TriggerDagRunOperator(
    task_id='no_script',
    trigger_dag_id='lv-manuf2020',
    trigger_rule='all_done',
    dag=dag,
)

if os.listdir(receiptPath):
    for files in os.listdir(receiptPath):
        if files.startswith('FMS') and files.endswith('.csv'):
            fmsFiles.append(files)
        if files.startswith('MEM') and files.endswith('.csv'):
            memFiles.append(files)
else:
    pass

for files in fmsFiles:
    run_Fms_Script = BashOperator(
        task_id="fms_script_"+files,
        bash_command='python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}"',
        dag=dag,
    )
    rerun_dag=TriggerDagRunOperator(
        task_id='rerun_dag',
        trigger_dag_id='lv-manuf2020',
        trigger_rule='none_failed',
        dag=dag,
    )
    run_scripts.set_downstream(run_Fms_Script)
    rerun_dag.set_upstream(run_Fms_Script)

for files in memFiles:
    run_Mem_Script = BashOperator(
        task_id="mem_script_"+files,
        bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}"',
        dag=dag,
    )
    rerun_dag=TriggerDagRunOperator(
        task_id='rerun_dag',
        trigger_dag_id='lv-manuf2020',
        trigger_rule='none_failed',
        dag=dag,
    )
    run_scripts.set_downstream(run_Mem_Script)
    rerun_dag.set_upstream(run_Mem_Script)






move_csv.set_upstream(file_sensor)
run_scripts.set_upstream(move_csv)
no_script.set_upstream(move_csv)

它不像我想的那样工作。在这个循环中,它调用了一个Python脚本,该脚本将启动一个Sh脚本。它正在创建任务,但在重新运行dag而不启动脚本后,它会立即执行

for files in memFiles:
    run_Mem_Script = BashOperator(
        task_id="mem_script_"+files,
        bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}"',
        dag=dag,
    )
    rerun_dag=TriggerDagRunOperator(
        task_id='rerun_dag',
        trigger_dag_id='lv-manuf2020',
        trigger_rule='none_failed',
        dag=dag,
    )
    run_scripts.set_downstream(run_Mem_Script)
    rerun_dag.set_upstream(run_Mem_Script)
有人能告诉我,如果有必要,如何使用BashOperator并行创建动态任务(因为我这样调用python脚本) 我需要像这样的东西


文件传感器>>移动csv>>运行脚本>>动态任务>>重新运行dag

创建dag文件时所有代码只运行一次,只有
onlyCsvFiles
函数作为任务的一部分定期运行。
Airflow导入python文件,该文件运行解释器并在DAG的原始.py文件旁边创建.pyc文件,由于代码不变,Airflow将不会再次运行DAG的代码,并在下次导入时始终使用相同的.pyc文件

.pyc文件是在导入.py文件时由Python解释器创建的。

为了添加或更改DAG的任务,必须创建一个进程,该进程定期运行解释器并更新.pyc文件。
有几种方法可以做到这一点,最好的方法是利用气流来做到这一点

我并不是建议用其他方法来创建动态任务,所以有了这种态度,您需要创建另一个任务来触发python文件的解释,用潜在的新任务“刷新”.pyc文件;它们在该循环内的运行时中表示:

for files in memFiles:
    run_Mem_Script = BashOperator(
        task_id="mem_script_"+files,
        bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}"',
        dag=dag,
    )
    rerun_dag=TriggerDagRunOperator(
        task_id='rerun_dag',
        trigger_dag_id='lv-manuf2020',
        trigger_rule='none_failed',
        dag=dag,
    )
python命令触发解释并更新.pyc文件。
在DAG中创建独立任务,如下所示(使用DAG的绝对路径编辑bash命令):

我不建议找到一个python函数来获取当前文件路径,因为在导入代码后,您可能会获取气流的运行路径,尽管它可能会工作

您的新代码:(我只在代码中添加了
exploration_python
任务,请记住用DAG文件的绝对路径替换
/path/to/this/file.py
):

如果您有任何与
exploration\u python
任务相关的运行时错误,请尝试先将
cd
转到airflow的基本路径(
airflow.cfg
目录),然后使用相对路径调用
python3

例如,如果气流的路径为
/home/username/afflow
,dag位于
/home/username/afflow/dags/mydag.py
,则定义
解释python
,如下所示:

interpret_python = BashOperator(
    task_id="interpret_python",
    bash_command='cd /home/username/airflow && python3 dags/mydag.py',
    dag=dag,
)  

最后一个代码片段就是python文件的其余部分?是的,我的dag文件的其余部分,只是放大一下,因为我的问题在哪里谢谢你的回答。我创建了Exploration_python,当我启动Dag时,Exploration会跳过所有下一个任务。如果我尝试使用bash命令删除这个。pyc?新任务应该在几分钟后更新并在您的airflow Web服务器可视化中显示,下一次Dag运行将运行它们(而不是运行Exploration_python并添加它们的当前任务)。您还可以重新启动Web服务器和计划程序以加快此过程,并且不要忘记刷新Web服务器页面。您的计划间隔是多少?事实上,我认为我的问题是另一个,在“this bash_command='python3'+scriptaiffort+'memShScript.py”中,该脚本memShScript.py调用bash脚本(带有subprocess.call),而我的问题是bashScript从未启动。Python执行得很好,但其中没有bash脚本
import datetime as dt
from airflow import DAG
import shutil
import os
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator, BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.contrib.sensors.file_sensor import FileSensor
from airflow.operators.dagrun_operator import TriggerDagRunOperator

scriptAirflow = '/home/alexw/scriptAirflow/'
uploadPath='/apps/lv-manuf2020-data/80_DATA/00_Loading/'
receiptPath= '/apps/lv-manuf2020-data/80_DATA/01_Receipt/'
fmsFiles=[]
memFiles=[]

def onlyCsvFiles():
    if(os.listdir(uploadPath)):
        for files in os.listdir(uploadPath):    
            if(files.startswith('MEM') and files.endswith('.csv') or files.startswith('FMS') and files.endswith('.csv')):
                shutil.move(uploadPath+files, receiptPath)
                print(files+' moved in ' + receiptPath+files)
        for files in os.listdir(receiptPath):
            if(files.startswith('MEM') and files.endswith('.csv') or files.startswith('FMS') and files.endswith('.csv')):
                return "run_scripts"
            else:
                return "no_script"
    else:
        print('No file in upload_00')



default_args = {
    'owner': 'manuf2020',
    'start_date': dt.datetime(2020, 2, 17),
    'retries': 1,
}


dag = DAG('lv-manuf2020', default_args=default_args, description='airflow_manuf2020',
          schedule_interval=None, catchup=False)


file_sensor = FileSensor(
    task_id="file_sensor",
    filepath=uploadPath,
    fs_conn_id='airflow_db',
    poke_interval=10,
    dag=dag,
)


move_csv = BranchPythonOperator(
    task_id='move_csv',
    python_callable=onlyCsvFiles,
    trigger_rule='none_failed',
    dag=dag,
)

run_scripts = DummyOperator(
    task_id="run_scripts",
    dag=dag
)

no_script= TriggerDagRunOperator(
    task_id='no_script',
    trigger_dag_id='lv-manuf2020',
    trigger_rule='all_done',
    dag=dag,
)

interpret_python = BashOperator(
    task_id="interpret_python",
    bash_command='python3 /path/to/this/file.py',
    dag=dag,
)

if os.listdir(receiptPath):
    for files in os.listdir(receiptPath):
        if files.startswith('FMS') and files.endswith('.csv'):
            fmsFiles.append(files)
        if files.startswith('MEM') and files.endswith('.csv'):
            memFiles.append(files)
else:
    pass

for files in fmsFiles:
    run_Fms_Script = BashOperator(
        task_id="fms_script_"+files,
        bash_command='python3 '+scriptAirflow+'fmsScript.py "{{ execution_date }}"',
        dag=dag,
    )
    rerun_dag=TriggerDagRunOperator(
        task_id='rerun_dag',
        trigger_dag_id='lv-manuf2020',
        trigger_rule='none_failed',
        dag=dag,
    )
    run_scripts.set_downstream(run_Fms_Script)
    rerun_dag.set_upstream(run_Fms_Script)

for files in memFiles:
    run_Mem_Script = BashOperator(
        task_id="mem_script_"+files,
        bash_command='python3 '+scriptAirflow+'memShScript.py "{{ execution_date }}"',
        dag=dag,
    )
    rerun_dag=TriggerDagRunOperator(
        task_id='rerun_dag',
        trigger_dag_id='lv-manuf2020',
        trigger_rule='none_failed',
        dag=dag,
    )
    run_scripts.set_downstream(run_Mem_Script)
    rerun_dag.set_upstream(run_Mem_Script)






move_csv.set_upstream(file_sensor)
run_scripts.set_upstream(move_csv)
no_script.set_upstream(move_csv)  
interpret_python = BashOperator(
    task_id="interpret_python",
    bash_command='cd /home/username/airflow && python3 dags/mydag.py',
    dag=dag,
)