Apache Airflow芹菜Redis解码错误

Apache Airflow芹菜Redis解码错误,apache,airflow,apache-airflow,airflow-scheduler,Apache,Airflow,Apache Airflow,Airflow Scheduler,使用最新版本的apache airflow。从LocalExecutor开始,在这种模式下,一切都正常工作,除了一些交互,web UI声明需要CeleryExecutor来使用它们。使用Redis安装并配置芹菜执行器,将Redis配置为代理URL和结果后端 开始时它似乎工作,直到计划任务时出现以下错误: File "/bin/airflow", line 28, in <module> args.func(args) File "/usr/lib/python2.7/s

使用最新版本的apache airflow。从LocalExecutor开始,在这种模式下,一切都正常工作,除了一些交互,web UI声明需要CeleryExecutor来使用它们。使用Redis安装并配置芹菜执行器,将Redis配置为代理URL和结果后端

开始时它似乎工作,直到计划任务时出现以下错误:

 File "/bin/airflow", line 28, in <module>
    args.func(args)
  File "/usr/lib/python2.7/site-packages/airflow/bin/cli.py", line 882, in scheduler
    job.run()
  File "/usr/lib/python2.7/site-packages/airflow/jobs.py", line 201, in run
    self._execute()
  File "/usr/lib/python2.7/site-packages/airflow/jobs.py", line 1311, in _execute
    self._execute_helper(processor_manager)
  File "/usr/lib/python2.7/site-packages/airflow/jobs.py", line 1444, in _execute_helper
    self.executor.heartbeat()
  File "/usr/lib/python2.7/site-packages/airflow/executors/base_executor.py", line 132, in heartbeat
    self.sync()
  File "/usr/lib/python2.7/site-packages/airflow/executors/celery_executor.py", line 91, in sync
    state = async.state
  File "/usr/lib/python2.7/site-packages/celery/result.py", line 436, in state
    return self._get_task_meta()['status']
  File "/usr/lib/python2.7/site-packages/celery/result.py", line 375, in _get_task_meta
    return self._maybe_set_cache(self.backend.get_task_meta(self.id))
  File "/usr/lib/python2.7/site-packages/celery/backends/base.py", line 352, in get_task_meta
    meta = self._get_task_meta_for(task_id)
  File "/usr/lib/python2.7/site-packages/celery/backends/base.py", line 668, in _get_task_meta_for
    return self.decode_result(meta)
  File "/usr/lib/python2.7/site-packages/celery/backends/base.py", line 271, in decode_result
    return self.meta_from_decoded(self.decode(payload))
  File "/usr/lib/python2.7/site-packages/celery/backends/base.py", line 278, in decode
    accept=self.accept)
  File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 263, in loads
    return decode(data)
  File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
    self.gen.throw(type, value, traceback)
  File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 54, in _reraise_errors
    reraise(wrapper, wrapper(exc), sys.exc_info()[2])
  File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 50, in _reraise_errors
    yield
  File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 263, in loads
    return decode(data)
  File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 59, in pickle_loads
    return load(BytesIO(s))
kombu.exceptions.DecodeError: invalid load key, '{'.
文件“/bin/afflow”,第28行,在
args.func(args)
调度器中的文件“/usr/lib/python2.7/site packages/afflow/bin/cli.py”,第882行
job.run()
文件“/usr/lib/python2.7/site packages/afflow/jobs.py”,第201行,正在运行
self._execute()
文件“/usr/lib/python2.7/site packages/afflow/jobs.py”,第1311行,在
self.\u execute\u helper(处理器\u管理器)
文件“/usr/lib/python2.7/site packages/afflow/jobs.py”,第1444行,在“执行”助手中
self.executor.heartbeat()
文件“/usr/lib/python2.7/site packages/afflow/executors/base_executor.py”,第132行,在heartbeat中
self.sync()
文件“/usr/lib/python2.7/site packages/afflow/executors/celery_executor.py”,第91行,同步
state=async.state
文件“/usr/lib/python2.7/site packages/芹菜/result.py”,第436行,处于状态
返回self.\u获取任务\u meta()['status']
文件“/usr/lib/python2.7/site packages/celery/result.py”,第375行,在“获取任务”meta中
返回self.\u可能\u设置\u缓存(self.backend.get\u任务\u元(self.id))
文件“/usr/lib/python2.7/site packages/celery/backends/base.py”,第352行,在get_task_meta中
meta=self.\u获取任务的meta(任务id)
文件“/usr/lib/python2.7/site packages/芹菜/backends/base.py”,第668行,位于
返回self.decode_结果(meta)
文件“/usr/lib/python2.7/site packages/芹菜/backends/base.py”,第271行,在解码结果中
从解码(自解码(有效负载))返回self.meta
文件“/usr/lib/python2.7/site packages/芹菜/backends/base.py”,第278行,解码
接受(self.accept)
文件“/usr/lib/python2.7/site packages/kombu/serialization.py”,第263行,装入
返回解码(数据)
文件“/usr/lib64/python2.7/contextlib.py”,第35行,在__
self.gen.throw(类型、值、回溯)
文件“/usr/lib/python2.7/site packages/kombu/serialization.py”,第54行,出现错误
重新发送(包装器,包装器(exc),sys.exc_info()[2])
文件“/usr/lib/python2.7/site packages/kombu/serialization.py”,第50行,出现错误
产量
文件“/usr/lib/python2.7/site packages/kombu/serialization.py”,第263行,装入
返回解码(数据)
文件“/usr/lib/python2.7/site packages/kombu/serialization.py”,第59行,pickle_加载
返回负载(字节)
kombu.exceptions.DecodeError:无效的加载键“{”。
似乎是pickle序列化错误,但我不确定如何查找原因。有什么建议吗

这个问题始终会影响我使用subdag功能的工作流,可能问题与此相关


注意:我还使用rabbitMQ进行了测试,但出现了另一个问题;客户端显示“由对等方重置连接”并崩溃。rabbitMQ日志显示“客户端意外关闭TCP连接”。

在调度程序日志中看到完全相同的回溯后,我偶然发现了这一点:

  File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 59, in pickle_loads
    return load(BytesIO(s))
kombu.exceptions.DecodeError: invalid load key, '{'.
05:03:49.145849 IP <scheduler-ip-addr>.ec2.internal.45597 > <redis-ip-addr>.ec2.internal.6379: Flags [P.], seq 658:731, ack 46, win 211, options [nop,nop,TS val 654768546 ecr 4219564282], length 73: RESP "GET" "celery-task-meta-b0d3a29e-ac08-4e77-871e-b4d553502cc2"
05:03:49.146086 IP <redis-ip-addr>.ec2.internal.6379 > <scheduler-ip-addr>.ec2.internal.45597: Flags [P.], seq 46:177, ack 731, win 210, options [nop,nop,TS val 4219564282 ecr 654768546], length 131: RESP "{"status": "SUCCESS", "traceback": null, "result": null, "task_id": "b0d3a29e-ac08-4e77-871e-b4d553502cc2", "children": []}"
芹菜试图解开以“{”开头的东西这一事实似乎是可疑的,因此我对流量进行了tcpdump,并通过web UI触发了一项任务。结果捕获包含此交换的时间几乎与上述回溯出现在调度程序日志中的时间完全相同:

  File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 59, in pickle_loads
    return load(BytesIO(s))
kombu.exceptions.DecodeError: invalid load key, '{'.
05:03:49.145849 IP <scheduler-ip-addr>.ec2.internal.45597 > <redis-ip-addr>.ec2.internal.6379: Flags [P.], seq 658:731, ack 46, win 211, options [nop,nop,TS val 654768546 ecr 4219564282], length 73: RESP "GET" "celery-task-meta-b0d3a29e-ac08-4e77-871e-b4d553502cc2"
05:03:49.146086 IP <redis-ip-addr>.ec2.internal.6379 > <scheduler-ip-addr>.ec2.internal.45597: Flags [P.], seq 46:177, ack 731, win 210, options [nop,nop,TS val 4219564282 ecr 654768546], length 131: RESP "{"status": "SUCCESS", "traceback": null, "result": null, "task_id": "b0d3a29e-ac08-4e77-871e-b4d553502cc2", "children": []}"
05:03:49.145849 IP.ec2.internal.45597>.ec2.internal.6379:Flags[P.],seq 658:731,ack 46,win 211,options[nop,nop,TS val 654768546 ecr 4219564282],长度73:RESP“GET”“芹菜-task-meta-b0d3a29e-ac08-4e77-871e-b4d553502cc2”
05:03:49.146086 IP.ec2.internal.6379>.ec2.internal.45597:Flags[P.],seq 46:177,ack 731,win 210,选项[nop,nop,TS val 4219564282 ecr 654768546],长度131:RESP“{”status:“SUCCESS”,“traceback”:null,“result”:null,“task_id:“b0d3a29e-ac08-4e77-871e-b4d553502cc2”,“children:[])”
Redis响应的有效负载显然是JSON,那么芹菜为什么要解开它呢?我们正在从Airflow 1.7迁移到1.8的过程中,在我们的推出过程中,一队Airflow workers运行v1.7,另一队运行v1.8。这些workers本来应该从工作负载不相交的队列中退出,但由于o在我们的DAG中,有一个TaskInstance由Airflow 1.8调度,然后由芹菜工人通过Airflow 1.7启动执行


将芹菜任务状态的序列化程序从JSON(默认值)更改为若要pickle,则在此更改之前运行代码版本的工作人员将以JSON序列化结果,而运行包含此更改的代码版本的调度程序将尝试通过取消pickle来反序列化结果,这会导致上述错误。

请验证您在afflow.cfg中配置了哪种芹菜\u result\u后端。如果不是这样,请尝试将其切换到数据库后端(mysql等)


我们通过ampq后端看到了这一点(仅适用于芹菜3.1及以下版本),redis和rpc后端有时会出现问题。

我们遇到了这个问题,我们的配置管理没有正确更新,因此调度程序和一些工作程序都在apache airflow 1.8.2上,而大量工作程序都在运行airflow 1.8.0。请检查所有节点是否运行相同版本的airflow