Callback 气流http回调传感器_Callback_Airflow

Callback 气流http回调传感器

airflow

Callback 气流http回调传感器,callback,airflow,Callback,Airflow,我们的airflow实现发送http请求，让服务执行任务。我们希望这些服务在完成任务时通知我们，因此我们将向服务发送一个回调url，当任务完成时他们将调用该服务。然而，我似乎找不到回叫传感器。人们通常如何处理这个问题？气流中没有回调或webhook传感器。传感器定义如下，摘自文件：传感器是一种特定类型的操作员，它将一直运行，直到满足特定标准。示例包括HDFS或S3中的特定文件登录、Hive中出现的分区或一天中的特定时间。传感器从BaseSensorOperator派生，并以指定的poke_间隔

我们的airflow实现发送http请求，让服务执行任务。我们希望这些服务在完成任务时通知我们，因此我们将向服务发送一个回调url，当任务完成时他们将调用该服务。然而，我似乎找不到回叫传感器。人们通常如何处理这个问题？

气流中没有回调或webhook传感器。传感器定义如下，摘自文件：

传感器是一种特定类型的操作员，它将一直运行，直到满足特定标准。示例包括HDFS或S3中的特定文件登录、Hive中出现的分区或一天中的特定时间。传感器从BaseSensorOperator派生，并以指定的poke_间隔运行poke方法，直到返回True

这意味着传感器是在外部系统上执行轮询行为的操作员。从这个意义上讲，您的外部服务应该有一种方法来保持每个已执行任务的状态（内部或外部），以便轮询传感器可以检查该状态

这样，您可以使用轮询HTTP端点直到满足条件的示例。或者更好的是，编写自己的自定义传感器，让您有机会进行更复杂的处理并保持状态

否则，如果服务在存储系统中输出数据，您可以使用轮询数据库的传感器。我相信你明白了

我附加了一个自定义操作符示例，该示例是我为与ApacheLivyAPI集成而编写的。传感器做两件事：a）通过RESTAPI提交Spark作业，b）等待作业完成

操作员扩展了简单HttpOperator，同时实现了HttpSensor，从而结合了这两种功能

class LivyBatchOperator(SimpleHttpOperator):
"""
Submits a new Spark batch job through
the Apache Livy REST API.

"""

template_fields = ('args',)
ui_color = '#f4a460'

@apply_defaults
def __init__(self,
             name,
             className,
             file,
             executorMemory='1g',
             driverMemory='512m',
             driverCores=1,
             executorCores=1,
             numExecutors=1,
             args=[],
             conf={},
             timeout=120,
             http_conn_id='apache_livy',
             *arguments, **kwargs):
    """
    If xcom_push is True, response of an HTTP request will also
    be pushed to an XCom.
    """
    super(LivyBatchOperator, self).__init__(
        endpoint='batches', *arguments, **kwargs)

    self.http_conn_id = http_conn_id
    self.method = 'POST'
    self.endpoint = 'batches'
    self.name = name
    self.className = className
    self.file = file
    self.executorMemory = executorMemory
    self.driverMemory = driverMemory
    self.driverCores = driverCores
    self.executorCores = executorCores
    self.numExecutors = numExecutors
    self.args = args
    self.conf = conf
    self.timeout = timeout
    self.poke_interval = 10

def execute(self, context):
    """
    Executes the task
    """

    payload = {
        "name": self.name,
        "className": self.className,
        "executorMemory": self.executorMemory,
        "driverMemory": self.driverMemory,
        "driverCores": self.driverCores,
        "executorCores": self.executorCores,
        "numExecutors": self.numExecutors,
        "file": self.file,
        "args": self.args,
        "conf": self.conf
    }
    print payload
    headers = {
        'X-Requested-By': 'airflow',
        'Content-Type': 'application/json'
    }

    http = HttpHook(self.method, http_conn_id=self.http_conn_id)

    self.log.info("Submitting batch through Apache Livy API")

    response = http.run(self.endpoint,
                        json.dumps(payload),
                        headers,
                        self.extra_options)

    # parse the JSON response
    obj = json.loads(response.content)

    # get the new batch Id
    self.batch_id = obj['id']

    log.info('Batch successfully submitted with Id %s', self.batch_id)

    # start polling the batch status
    started_at = datetime.utcnow()
    while not self.poke(context):
        if (datetime.utcnow() - started_at).total_seconds() > self.timeout:
            raise AirflowSensorTimeout('Snap. Time is OUT.')

        sleep(self.poke_interval)

    self.log.info("Batch %s has finished", self.batch_id)

def poke(self, context):
    '''
    Function that the sensors defined while deriving this class should
    override.
    '''

    http = HttpHook(method='GET', http_conn_id=self.http_conn_id)

    self.log.info("Calling Apache Livy API to get batch status")

    # call the API endpoint
    endpoint = 'batches/' + str(self.batch_id)
    response = http.run(endpoint)

    # parse the JSON response
    obj = json.loads(response.content)

    # get the current state of the batch
    state = obj['state']

    # check the batch state
    if (state == 'starting') or (state == 'running'):
        # if state is 'starting' or 'running'
        # signal a new polling cycle
        self.log.info('Batch %s has not finished yet (%s)',
                      self.batch_id, state)
        return False
    elif state == 'success':
        # if state is 'success' exit
        return True
    else:
        # for all other states
        # raise an exception and
        # terminate the task
        raise AirflowException(
            'Batch ' + str(self.batch_id) + ' failed (' + state + ')')

希望这能对您有所帮助。

谢谢您的有趣回答，spilio。我正在考虑使用livy将火花作业与气流相结合。我当时也在想同样的问题，今天我找到了你的答案。但我还没试过。你能告诉我一些关于你使用气流/livy/spark stack的经验吗？在我看来，livy的发展进展不是很快，你发现了什么重大问题吗？livy已经足够稳定满足您的需求了吗？老实说，我对livy很陌生，但从一段时间以来，我一直在使用气流而不是火花。你的任何答复都会大有帮助。谢谢你好@Srikanth。到目前为止，我对ApacheLivy的体验非常顺利，因为它通过非常简单易用的RESTAPI封装了作业提交任务。在使用Livy之前，我必须使用标准CLI命令向集群提交Spark作业，这要求Spark二进制文件在客户机上可用。当我使用Hortonworks数据平台将Livy添加到集群时，只需点击Ambari即可。此外，通过Livy提交作业本质上是异步的，允许您执行非阻塞任务。这是非常有用的，因为您可以有不同类型的操作员等待作业完成-要么是提交/轮询操作员（如我共享的同时执行两个作业的操作员），要么是仅轮询操作员（等待作业完成，然后继续执行其他任务）。这可以为您的流程提供新的动态，并以非常有用的方式将事情解耦。希望这有帮助：）谢谢你的宝贵意见，斯皮利奥。这确实有助于做出有利于气流+livy+spark的决定。谢谢@spilio。这真的很有帮助。