Airflow 使用气流时出错'；s DataflowPythonOperator以计划数据流作业_Airflow_Dataflow

Airflow 使用气流时出错'；s DataflowPythonOperator以计划数据流作业

airflow

Airflow 使用气流时出错'；s DataflowPythonOperator以计划数据流作业,airflow,dataflow,Airflow,Dataflow,我正在尝试使用airflow的DataflowPythonOperator计划数据流作业。这是我的dag操作员： test = DataFlowPythonOperator( task_id = 'my_task', py_file = 'path/my_pyfile.py', gcp_conn_id='my_conn_id', dataflow_default_options={ "project": 'my_project',

我正在尝试使用airflow的DataflowPythonOperator计划数据流作业。这是我的dag操作员：

test = DataFlowPythonOperator(
    task_id = 'my_task',
    py_file = 'path/my_pyfile.py',
    gcp_conn_id='my_conn_id',
    dataflow_default_options={
        "project": 'my_project',
        "runner": "DataflowRunner",
        "job_name": 'my_job',
        "staging_location": 'gs://my/staging', 
        "temp_location": 'gs://my/temping',
        "requirements_file": 'path/requirements.txt'
  }
)

gcp_conn_id已设置，可以正常工作。错误显示数据流失败，返回代码为1。完整日志如下所示

[2018-07-05 18:24:39,928] {gcp_dataflow_hook.py:108} INFO - Start waiting for DataFlow process to complete.
[2018-07-05 18:24:40,049] {base_task_runner.py:95} INFO - Subtask: 
[2018-07-05 18:24:40,049] {models.py:1433} ERROR - DataFlow failed with return code 1
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: Traceback (most recent call last):
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/models.py", line 1390, in run
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: result = task_copy.execute(context=context)
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/operators/dataflow_operator.py", line 182, in execute
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: self.py_file, self.py_options)
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 152, in start_python_dataflow
[2018-07-05 18:24:40,050] {base_task_runner.py:95} INFO - Subtask: task_id, variables, dataflow, name, ["python"] + py_options)
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 138, in _start_dataflow
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: _Dataflow(cmd).wait_for_done()
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/hooks/gcp_dataflow_hook.py", line 119, in wait_for_done
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: self._proc.returncode))
[2018-07-05 18:24:40,051] {base_task_runner.py:95} INFO - Subtask: Exception: DataFlow failed with return code 1

gcp_dataflow_hook.py似乎有问题，除此之外，没有更多信息。有没有办法解决这个问题？有没有DataflowPythonOperator的例子？（我目前还没有找到任何用例）

我没有收到相同的错误消息，但我认为这可能会有所帮助。python Dataflow runner似乎以一种奇怪的方式终止，这种方式不影响独立的数据流作业，但DataFlowPythonOperator python类无法正确处理。我正在提交一张罚单，但这里有一个解决我问题的方法。重要！修补程序必须应用于数据流作业，而不是气流作业

在数据流作业的顶部添加以下导入

import threading
import time
import types   
from apache_beam.runners.runner import PipelineState

接下来在数据流代码上方添加以下内容。这主要是从主~dataflow.dataflow_runner类中剪切粘贴的，带有注释编辑

def local_poll_for_job_completion(runner, result, duration):
    """Polls for the specified job to finish running (successfully or not).
    Updates the result with the new job information before returning.
    Args:
      runner: DataflowRunner instance to use for polling job state.
      result: DataflowPipelineResult instance used for job information.
      duration (int): The time to wait (in milliseconds) for job to finish.
        If it is set to :data:`None`, it will wait indefinitely until the job
        is finished.
    """
    last_message_time = None
    current_seen_messages = set()

    last_error_rank = float('-inf')
    last_error_msg = None
    last_job_state = None
    # How long to wait after pipeline failure for the error
    # message to show up giving the reason for the failure.
    # It typically takes about 30 seconds.
    final_countdown_timer_secs = 50.0
    sleep_secs = 5.0

    # Try to prioritize the user-level traceback, if any.
    def rank_error(msg):
        if 'work item was attempted' in msg:
            return -1
        elif 'Traceback' in msg:
            return 1
        return 0

    if duration:
        start_secs = time.time()
        duration_secs = duration // 1000

    job_id = result.job_id()
    keep_checking = True  ### Changed here!!!
    while keep_checking:  ### Changed here!!!
        response = runner.dataflow_client.get_job(job_id)
        # If get() is called very soon after Create() the response may not contain
        # an initialized 'currentState' field.
        logging.info("Current state: " + str(response.currentState))
        # Stop looking if the job is not terminating normally
        if str(response.currentState) in (  ### Changed here!!!
                'JOB_STATE_DONE',  ### Changed here!!!
                'JOB_STATE_CANCELLED',  ### Changed here!!!
                # 'JOB_STATE_UPDATED',
                'JOB_STATE_DRAINED',  ### Changed here!!!
                'JOB_STATE_FAILED'):  ### Changed here!!!
            keep_checking = False  ### Changed here!!!
            break
        if response.currentState is not None:
            if response.currentState != last_job_state:
                logging.info('Job %s is in state %s', job_id, response.currentState)
                last_job_state = response.currentState
            if str(response.currentState) != 'JOB_STATE_RUNNING':
                # Stop checking for new messages on timeout, explanatory
                # message received, success, or a terminal job state caused
                # by the user that therefore doesn't require explanation.
                if (final_countdown_timer_secs <= 0.0
                        or last_error_msg is not None
                        or str(response.currentState) == 'JOB_STATE_UPDATED'):  ### Changed here!!!
                    keep_checking = False  ### Changed here!!!
                    break

                # Check that job is in a post-preparation state before starting the
                # final countdown.
                if (str(response.currentState) not in (
                        'JOB_STATE_PENDING', 'JOB_STATE_QUEUED')):
                    # The job has failed; ensure we see any final error messages.
                    sleep_secs = 1.0      # poll faster during the final countdown
                    final_countdown_timer_secs -= sleep_secs

        time.sleep(sleep_secs)

        # Get all messages since beginning of the job run or since last message.
        page_token = None
        while True:
            messages, page_token = runner.dataflow_client.list_messages(
                job_id, page_token=page_token, start_time=last_message_time)
            for m in messages:
                message = '%s: %s: %s' % (m.time, m.messageImportance, m.messageText)

                if not last_message_time or m.time > last_message_time:
                    last_message_time = m.time
                    current_seen_messages = set()

                if message in current_seen_messages:
                    # Skip the message if it has already been seen at the current
                    # time. This could be the case since the list_messages API is
                    # queried starting at last_message_time.
                    continue
                else:
                    current_seen_messages.add(message)
                # Skip empty messages.
                if m.messageImportance is None:
                    continue
                logging.info(message)
                if str(m.messageImportance) == 'JOB_MESSAGE_ERROR':
                    if rank_error(m.messageText) >= last_error_rank:
                        last_error_rank = rank_error(m.messageText)
                        last_error_msg = m.messageText
            if not page_token:
                break

        if duration:
            passed_secs = time.time() - start_secs
            if passed_secs > duration_secs:
                logging.warning('Timing out on waiting for job %s after %d seconds',
                                job_id, passed_secs)
                break

    result._job = response
    runner.last_error_msg = last_error_msg


def local_is_in_terminal_state(self):
    logging.info("Current Dataflow job state: " + str(self.state))
    logging.info("Current has_job: " + str(self.has_job))
    if self.state in ('DONE', 'CANCELLED', 'DRAINED', 'FAILED'):
        return True
    else:
        return False


class DataflowRuntimeException(Exception):
    """Indicates an error has occurred in running this pipeline."""

    def __init__(self, msg, result):
        super(DataflowRuntimeException, self).__init__(msg)
        self.result = result


def local_wait_until_finish(self, duration=None):
    logging.info("!!!!!!!!!!!!!!!!You are in a Monkey Patch!!!!!!!!!!!!!!!!")
    if not local_is_in_terminal_state(self):  ### Changed here!!!
        if not self.has_job:
            raise IOError('Failed to get the Dataflow job id.')

        # DataflowRunner.poll_for_job_completion(self._runner, self, duration)
        thread = threading.Thread(
            target=local_poll_for_job_completion,  ### Changed here!!!
            args=(self._runner, self, duration))

        # Mark the thread as a daemon thread so a keyboard interrupt on the main
        # thread will terminate everything. This is also the reason we will not
        # use thread.join() to wait for the polling thread.
        thread.daemon = True
        thread.start()
        while thread.isAlive():
            time.sleep(5.0)

        terminated = local_is_in_terminal_state(self)  ### Changed here!!!
        logging.info("Terminated state: " + str(terminated))
        # logging.info("duration: " + str(duration))
        # assert duration or terminated, (  ### Changed here!!!
        #     'Job did not reach to a terminal state after waiting indefinitely.')  ### Changed here!!!

        assert terminated, "Timed out after duration: " + str(duration)  ### Changed here!!!

    else:  ### Changed here!!!
        assert False, "local_wait_till_finish failed at the start"  ### Changed here!!!

    if self.state != PipelineState.DONE:
        # TODO(BEAM-1290): Consider converting this to an error log based on
        # theresolution of the issue.
        raise DataflowRuntimeException(
            'Dataflow pipeline failed. State: %s, Error:\n%s' %
            (self.state, getattr(self._runner, 'last_error_msg', None)), self)

    return self.state

最后，在构建管道后，使用以下命令

result = p.run()
# Monkey patch to better handle termination
result.wait_until_finish = types.MethodType(local_wait_until_finish, result)
result.wait_until_finish()

注意：如果您运行的是airflow server v1.9版，就像我使用的是1.10补丁文件一样，此修复程序仍然无法解决问题。_Dataflow.wait_for_done的补丁文件函数没有返回作业\u id，它也需要。修补程序的修补程序比上面的更糟糕。如果可以，请升级。如果无法将以下代码作为标题粘贴到带有最新文件的Dag脚本中，则应该可以使用。添加了气流/contrib/hooks/gcp_api_base_hook.py、气流/contrib/hooks/gcp_dataflow_hook.py和气流/contrib/operators/dataflow_operator.py

import threading
import time
import types   
from apache_beam.runners.runner import PipelineState

将with子句更改为

p=beam.Pipeline（options=Pipeline\u options）

，我以

result结束了它。等待\u直到\u finish（）

我的DAG成功了

数据流失败，返回代码为1

在我看来，就是数据流引发了一个错误，而不是

gcp\u DataFlow\u hook.py

或

DataflowPythonOperator

。因此，您可能需要在云控制台中查找与您呼叫的数据流有关的错误。我面临着完全相同的问题，@Lisa.Z您是如何解决的？您的apache beam版本是什么？

import threading
import time
import types   
from apache_beam.runners.runner import PipelineState