Airflow 操作员日志不存在';不包含完整输出

Airflow 操作员日志不存在';不包含完整输出,airflow,Airflow,我有一个问题,bash操作符没有记录wget的所有输出。它将只记录输出的前1-5行 我只使用wget作为bash命令进行了尝试: tester = BashOperator( task_id = 'testing', bash_command = "wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-c

我有一个问题,bash操作符没有记录wget的所有输出。它将只记录输出的前1-5行

我只使用wget作为bash命令进行了尝试:

tester = BashOperator(
    task_id = 'testing',
    bash_command = "wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip",
    dag = dag)
我还尝试将此作为较长bash脚本的一部分,该脚本具有跟随wget的其他命令。在触发下游任务之前,将等待脚本完成。下面是一个bash脚本示例:

#!/bin/bash
echo "Starting up..."
wget -N -r -nd --directory-prefix='/tmp/' http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip
echo "Download complete..."
unzip /tmp/httpcomponents-client-4.5.3-src.zip -o -d /tmp/test_airflow
echo "Archive unzipped..."
日志文件的最后几行:

[2017-04-13 18:33:34,214] {base_task_runner.py:95} INFO - Subtask: --------------------------------------------------------------------------------
[2017-04-13 18:33:34,214] {base_task_runner.py:95} INFO - Subtask: Starting attempt 1 of 1
[2017-04-13 18:33:34,215] {base_task_runner.py:95} INFO - Subtask: --------------------------------------------------------------------------------
[2017-04-13 18:33:34,215] {base_task_runner.py:95} INFO - Subtask: 
[2017-04-13 18:33:35,068] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:35,068] {models.py:1342} INFO - Executing <Task(BashOperator): testing> on 2017-04-13 18:33:08
[2017-04-13 18:33:37,569] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:37,569] {bash_operator.py:71} INFO - tmp dir root location: 
[2017-04-13 18:33:37,569] {base_task_runner.py:95} INFO - Subtask: /tmp
[2017-04-13 18:33:37,571] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:33:37,571] {bash_operator.py:81} INFO - Temporary script location :/tmp/airflowtmpqZhPjB//tmp/airflowtmpqZhPjB/testingCkJgDE
[2017-04-13 18:14:54,943] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,942] {bash_operator.py:82} INFO - Running command: /var/www/upstream/xtractor/scripts/Temp_test.sh 
[2017-04-13 18:14:54,951] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,950] {bash_operator.py:91} INFO - Output:
[2017-04-13 18:14:54,955] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,954] {bash_operator.py:96} INFO - Starting up...
[2017-04-13 18:14:54,958] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:54,957] {bash_operator.py:96} INFO - --2017-04-13 18:14:54--  http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip
[2017-04-13 18:14:55,106] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,105] {bash_operator.py:96} INFO - Resolving apache.cs.utah.edu (apache.cs.utah.edu)... 155.98.64.87
[2017-04-13 18:14:55,186] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,186] {bash_operator.py:96} INFO - Connecting to apache.cs.utah.edu (apache.cs.utah.edu)|155.98.64.87|:80... connected.
[2017-04-13 18:14:55,284] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,284] {bash_operator.py:96} INFO - HTTP request sent, awaiting response... 200 OK
[2017-04-13 18:14:55,285] {base_task_runner.py:95} INFO - Subtask: [2017-04-13 18:14:55,284] {bash_operator.py:96} INFO - Length: 1662639 (1.6M) [application/zip]
[2017-04-13 18:15:01,485] {jobs.py:2083} INFO - Task exited with return code 0
[2017-04-13 18:33:34214]{base_task_runner.py:95}信息-子任务:--------------------------------------------------------------------------------
[2017-04-13 18:33:34214]{base_task_runner.py:95}INFO-子任务:开始尝试1次,共1次
[2017-04-13 18:33:34215]{base_task_runner.py:95}信息-子任务:--------------------------------------------------------------------------------
[2017-04-13 18:33:34215]{base_task_runner.py:95}信息-子任务:
[2017-04-13 18:33:35068]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:33:35068]{models.py:1342}信息-于2017-04-13 18:33:08执行
[2017-04-13 18:33:37569]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:33:37569]{bash_operator.py:71}信息-tmp目录根位置:
[2017-04-13 18:33:37569]{base_task_runner.py:95}信息-子任务:/tmp
[2017-04-13 18:33:37571]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:33:37571]{bash_operator.py:81}信息-临时脚本位置:/tmp/airflowtmpqZhPjB//tmp/airflowtmpqZhPjB/testingCkJgDE
[2017-04-13 18:14:54943]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:54942]{bash_operator.py:82}信息-运行命令:/var/www/upstream/xtractor/scripts/Temp_test.sh
[2017-04-13 18:14:54951]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:54950]{bash_operator.py:91}信息-输出:
[2017-04-13 18:14:54955]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:54954]{bash_operator.py:96}信息-启动。。。
[2017-04-13 18:14:54958]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:54957]{bash_operator.py:96}信息----2017-04-13 18:14:54--http://apache.cs.utah.edu/httpcomponents/httpclient/source/httpcomponents-client-4.5.3-src.zip
[2017-04-13 18:14:55106]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:55105]{bash_operator.py:96}信息-解析apache.cs.犹他.edu(apache.cs.犹他.edu)。。。155.98.64.87
[2017-04-13 18:14:55186]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:55186]{bash_operator.py:96}信息-连接到apache.cs.犹他.edu(apache.cs.犹他.edu)| 155.98.64.87 |:80。。。有联系的。
[2017-04-13 18:14:55284]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:55284]{bash_operator.py:96}信息-HTTP请求已发送,等待响应。。。200行
[2017-04-13 18:14:55285]{base_task_runner.py:95}信息-子任务:[2017-04-13 18:14:55284]{bash_operator.py:96}信息-长度:1662639(1.6M)[application/zip]
[2017-04-13 18:15:01485]{jobs.py:2083}信息-任务已退出,返回代码为0

编辑:更多测试表明,记录wget的输出是一个问题。

因为在默认操作符中,只打印最后一行。请在安装气流的任何位置,将代码替换为以下内部
气流/operators/bash_operator.py
。通常,您需要查看python所在的位置,然后转到
站点包

from builtins import bytes
import os
import signal
import logging
from subprocess import Popen, STDOUT, PIPE
from tempfile import gettempdir, NamedTemporaryFile

from airflow.exceptions import AirflowException
from airflow.models import BaseOperator
from airflow.utils.decorators import apply_defaults
from airflow.utils.file import TemporaryDirectory


class BashOperator(BaseOperator):
    """
    Execute a Bash script, command or set of commands.

    :param bash_command: The command, set of commands or reference to a
        bash script (must be '.sh') to be executed.
    :type bash_command: string
    :param xcom_push: If xcom_push is True, the last line written to stdout
        will also be pushed to an XCom when the bash command completes.
    :type xcom_push: bool
    :param env: If env is not None, it must be a mapping that defines the
        environment variables for the new process; these are used instead
        of inheriting the current process environment, which is the default
        behavior. (templated)
    :type env: dict
    :type output_encoding: output encoding of bash command
    """
    template_fields = ('bash_command', 'env')
    template_ext = ('.sh', '.bash',)
    ui_color = '#f0ede4'

    @apply_defaults
    def __init__(
            self,
            bash_command,
            xcom_push=False,
            env=None,
            output_encoding='utf-8',
            *args, **kwargs):

        super(BashOperator, self).__init__(*args, **kwargs)
        self.bash_command = bash_command
        self.env = env
        self.xcom_push_flag = xcom_push
        self.output_encoding = output_encoding

    def execute(self, context):
        """
        Execute the bash command in a temporary directory
        which will be cleaned afterwards
        """
        bash_command = self.bash_command
        logging.info("tmp dir root location: \n" + gettempdir())
        line_buffer = []        
        with TemporaryDirectory(prefix='airflowtmp') as tmp_dir:
            with NamedTemporaryFile(dir=tmp_dir, prefix=self.task_id) as f:

                f.write(bytes(bash_command, 'utf_8'))
                f.flush()
                fname = f.name
                script_location = tmp_dir + "/" + fname
                logging.info("Temporary script "
                             "location :{0}".format(script_location))
                logging.info("Running command: " + bash_command)
                sp = Popen(
                    ['bash', fname],
                    stdout=PIPE, stderr=STDOUT,
                    cwd=tmp_dir, env=self.env,
                    preexec_fn=os.setsid)

                self.sp = sp

                logging.info("Output:")
                line = ''

                for line in iter(sp.stdout.readline, b''):
                    line = line.decode(self.output_encoding).strip()
                    line_buffer.append(line)
                    logging.info(line)
                sp.wait()
                logging.info("Command exited with "
                             "return code {0}".format(sp.returncode))

                if sp.returncode:
                    raise AirflowException("Bash command failed")
        logging.info("\n".join(line_buffer))
        if self.xcom_push_flag:
            return "\n".join(line_buffer)

    def on_kill(self):
        logging.info('Sending SIGTERM signal to bash process group')
        os.killpg(os.getpgid(self.sp.pid), signal.SIGTERM)

这不是一个完整的答案,但它是向前迈出的一大步。问题似乎是Python的日志函数和输出
wget
products的问题。原来气流计划程序引发了一个错误:
unicodeincoder错误:“ascii”编解码器无法在位置上对字符u'\u2018'进行编码….

我在气流代码库中修改了bash_operator.py,以便对bash输出进行编码(在第95行):

错误仍在发生,但至少它与bash脚本其余部分的输出一起出现在日志文件中。现在日志文件中出现的错误是:
UnicodeDecodeError:'ascii'编解码器无法解码位置中的字节0xe2…


尽管仍然有一个Python错误发生,但是输出被记录下来了,所以我现在很满意。如果有人对如何更好地解决这个问题有想法,我愿意接受这些想法。

我做了这些改变,但似乎没有什么不同。是否正在使用“logging.info()”调用写入日志文件?您是对的。您可以检查xcom对象,它将包含所有行。在
self.xcom\u push\u标志将其更改为反映之前,可以将return语句更改为
logging.info(“\n”.join(line\u buffer))
it@Christian我太懒了。是的,它应该。是否有公关或Jira的问题,以增加这个气流?我发现看不到失败的bash作业的日志是一个很大的可用性问题。
loging.info(line.encode('utf-8'))