Python 同时调度大量DAG时出现重复输入错误

Python 同时调度大量DAG时出现重复输入错误,python,datetime,airflow,Python,Datetime,Airflow,我正在运行此测试: from unittest import TestCase from backend.tasks.airflow import trigger_dag class TestTriggerDag(TestCase): def test_trigger_dag(self): trigger_dag("update_game_dag", game_id=99) trigger_dag("update_ga

我正在运行此测试:

from unittest import TestCase

from backend.tasks.airflow import trigger_dag


class TestTriggerDag(TestCase):

    def test_trigger_dag(self):
        trigger_dag("update_game_dag", game_id=99)
        trigger_dag("update_game_dag", game_id=100)
        trigger_dag("update_game_dag", game_id=101)
        trigger_dag("update_game_dag", game_id=102)
        trigger_dag("update_game_dag", game_id=103)
        self.assertTrue(True)
触发器的内部结构是:

from typing import List
import random
import time

from airflow.api.client.local_client import Client
from airflow.models.dagrun import DagRun

afc = Client(None, None)

...
def get_dag_run_state(dag_id: str, run_id: str):
    return DagRun.find(dag_id=dag_id, run_id=run_id)[0].state


def trigger_dag(dag_id: str, wait_for_complete: bool = False, **kwargs):
    run_hash = '%030x' % random.randrange(16**30)
    kwarg_list = [f"{str(k)}:{str(v)}" for k, v in kwargs.items()]
    run_id = f"{run_hash}-{'_'.join(kwarg_list)}"
    afc.trigger_dag(dag_id, run_id=run_id, conf=kwargs)
    while wait_for_complete and get_dag_run_state(dag_id, run_id) == "running":
        time.sleep(1)
        continue
    return get_dag_run_state(dag_id, run_id)
这会引发以下错误:

sqlalchemy.exc.IntegrityError: (pymysql.err.IntegrityError) (1062, "Duplicate entry 'update_game_dag-2020-08-30 00:30:13.000000' for key 'dag_run.dag_id'")
[SQL: INSERT INTO dag_run (dag_id, execution_date, start_date, end_date, state, run_id, external_trigger, conf) VALUES (%(dag_id)s, %(execution_date)s, %(start_date)s, %(end_date)s, %(state)s, %(run_id)s, %(external_trigger)s, %(conf)s)]
[parameters: {'dag_id': 'update_game_dag', 'execution_date': datetime.datetime(2020, 8, 30, 0, 30, 13), 'start_date': datetime.datetime(2020, 8, 30, 0, 30, 13, 262676), 'end_date': None, 'state': 'running', 'run_id': '3129c0272d7e3e5f018d04d2debf06-game_id:101', 'external_trigger': 1, 'conf': b'\x80\x04\x95\x10\x00\x00\x00\x00\x00\x00\x00}\x94\x8c\x07game_id\x94Kes.'}]
问题似乎在于,将DAG运行记录到
DAG\u运行
元数据表时,
execution\u date
列时间戳保存在第二个(
datetime.datetime(2020,8,30,0,30,13),
)分辨率,而不是微秒。当同时启动一组DAG时,会产生重复条目冲突。有趣的是,
start\u date
不是这样工作的:它保留微秒信息(
datetime.datetime(2020,8,30,0,30,13262676)

这是一个特性还是一个bug?对于给定的DAG id,气流是否允许在同一秒间隔内启动多个外部触发的DAG运行?是否有一个快速修复,或者我应该发送一个公关或ASF吉拉发行票


如果相关的话,我想这样做,因为我有一大堆资产需要在游戏级别上以5分钟的节奏为我们应用程序上的所有用户更新。我们使用芹菜节拍作为我们的应用程序调度程序,而不是气流。气流真正起作用的地方是在弹性伸缩的工作集群上协调任务图的执行。所以每5分钟我想说“嘿,气流,请为这200个游戏触发DAG。”DAG从我们传递到DAG上下文的
conf
数据以及外部API调用中具有游戏id感知。

再保持几分钟,找到了解决方案
local_client
from
from airflow.api.client.local_client import client
只需使用
airflow.api.common.experimental
中的两个方法从
airflow.api.client
打包一个基类即可:

from airflow.api.client import api_client
from airflow.api.common.experimental import pool
from airflow.api.common.experimental import trigger_dag
from airflow.api.common.experimental import delete_dag


class Client(api_client.Client):
    """Local API client implementation."""

    def trigger_dag(self, dag_id, run_id=None, conf=None, execution_date=None):
        dag_run = trigger_dag.trigger_dag(dag_id=dag_id,
                                          run_id=run_id,
                                          conf=conf,
                                          execution_date=execution_date)
        return "Created {}".format(dag_run)

    def delete_dag(self, dag_id):
        count = delete_dag.delete_dag(dag_id)
        return "Removed {} record(s)".format(count)

    def get_pool(self, name):
        the_pool = pool.get_pool(name=name)
        return the_pool.pool, the_pool.slots, the_pool.description

    def get_pools(self):
        return [(p.pool, p.slots, p.description) for p in pool.get_pools()]

    def create_pool(self, name, slots, description):
        the_pool = pool.create_pool(name=name, slots=slots, description=description)
        return the_pool.pool, the_pool.slots, the_pool.description

    def delete_pool(self, name):
        the_pool = pool.delete_pool(name=name)
        return the_pool.pool, the_pool.slots, the_pool.description
这是一种奇怪的方法,因为这里没有一个类方法真正调用
api\u client.client
基类<来自
气流.api.common.experimental
的code>触发器_dag
有一个参数
替换_微秒
。这就是信息被删除的地方

使用
replace\u microseconds=True直接调用
aiffort.api.common.experional.trigger\u dag
解决了我的问题:

from typing import List
import random
import time

from airflow.api.common.experimental import trigger_dag
from airflow.models.dagrun import DagRun


def log_headline(keys: tuple, values: List):
    headline_ls = [f"{key} = {value}" for key, value in zip(keys, values)]
    print("\n \n*** ARGUMENTS ***\n-----------------\n" + ", ".join(headline_ls) + "\n-----------------\n")


def context_parser(context: dict, *args: str):
    """*args looks for an inventory of names from the context that we expect a given task to have access to. Use of
    the .get access method means that misses names will default to None rather than generate a key error"""
    return_values = [context['dag_run'].conf.get(arg) for arg in args]
    log_headline(args, return_values)
    return return_values


def get_dag_run_state(dag_id: str, run_id: str):
    return DagRun.find(dag_id=dag_id, run_id=run_id)[0].state


def start_dag(dag_id: str, wait_for_complete: bool = False, **kwargs):
    run_hash = '%030x' % random.randrange(16**30)
    kwarg_list = [f"{str(k)}:{str(v)}" for k, v in kwargs.items()]
    run_id = f"{run_hash}-{'_'.join(kwarg_list)}"
    trigger_dag.trigger_dag(dag_id, run_id=run_id, conf=kwargs, replace_microseconds=False)
    while wait_for_complete and get_dag_run_state(dag_id, run_id) == "running":
        time.sleep(1)
        continue
    return get_dag_run_state(dag_id, run_id)