Apache zookeeper Mesos复制主控台不';当活动主机发生故障时,无法继续

Apache zookeeper Mesos复制主控台不';当活动主机发生故障时,无法继续,apache-zookeeper,centos7,mesos,Apache Zookeeper,Centos7,Mesos,我有以下设置-名为master、box01、box02、box03的4个CentOS 7.0虚拟机 主机虚拟机有:mesos主机,mesos从机 box01:mesos主机、mesos从机、zkServer box02:mesos主机、mesos从机、zkServer box03:mesos从站,zkServer 无论何时,我在集群上运行mesos框架,但没有zookeeper,一切都正常运行。然而,当我部署并启动zookeeper集群时,我运行的框架只有在从活动mesos主机所在的同一台机器上

我有以下设置-名为master、box01、box02、box03的4个CentOS 7.0虚拟机

主机虚拟机有:mesos主机,mesos从机

box01:mesos主机、mesos从机、zkServer

box02:mesos主机、mesos从机、zkServer

box03:mesos从站,zkServer

无论何时,我在集群上运行mesos框架,但没有zookeeper,一切都正常运行。然而,当我部署并启动zookeeper集群时,我运行的框架只有在从活动mesos主机所在的同一台机器上运行时才会完成

我有一位被选出来的船长在01号信箱。如果我从box01运行一个框架,它会很好地完成。如果我从主框运行它,我会在客户端获得以下日志,并且它永远不会继续:

I1101 13:56:11.997733  5384 sched.cpp:164] Version: 0.24.0
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@716: Client environment:host.name=master.localdomain
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@724: Client environment:os.arch=3.10.0-229.el7.x86_64
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@725: Client environment:os.version=#1 SMP Fri Mar 6 11:36:42 UTC 2015
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@log_env@753: Client environment:user.dir=/home/user/download
2015-11-01 13:56:12,011:5383(0x7f55fee16700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=box01:2181,box02:2181,box03:2181 sessionTimeout=10000 watcher=0x7f560236e6d4 sessionId=0 sessionPasswd=<null> context=0x7f5604003c50 flags=0
2015-11-01 13:56:12,018:5383(0x7f55fd613700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.0.11:2181]
2015-11-01 13:56:12,025:5383(0x7f55fd613700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.0.11:2181], sessionId=0x150c2c9ffc6002d, negotiated timeout=10000
I1101 13:56:12.027992  5398 group.cpp:331] Group process (group(1)@10.0.0.10:35217) connected to ZooKeeper
I1101 13:56:12.028153  5398 group.cpp:805] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I1101 13:56:12.028198  5398 group.cpp:403] Trying to create path '/mesos' in ZooKeeper
I1101 13:56:12.036267  5398 detector.cpp:156] Detected a new leader: (id='11')
I1101 13:56:12.037309  5398 group.cpp:674] Trying to get '/mesos/json.info_0000000011' in ZooKeeper
I1101 13:56:12.041631  5398 detector.cpp:481] A new leading master (UPID=master@10.0.0.11:5050) is detected
I1101 13:56:12.042068  5398 sched.cpp:262] New master detected at master@10.0.0.11:5050
I1101 13:56:12.043937  5398 sched.cpp:272] No credentials provided. Attempting to register without authentication
/etc/hosts文件

127.0.0.1   localhost localhost.localdomain localhost4 localhost4.localdomain4
::1         localhost localhost.localdomain localhost6 localhost6.localdomain6
10.0.0.10   master master.localdomain
10.0.0.11   box01 box01.localdomain
10.0.0.12   box02 box02.localdomain
10.0.0.13   box03 box03.localdomain
在每台机器上,防火墙设置为:

--firewall-cmd --list-ports
5051/tcp 3888/tcp 2181/tcp 2888/tcp 5050/tcp
要启动mesos master,我使用:

/home/user/download/mesos-0.24.0/build/bin/mesos-master.sh --ip=10.0.0.10 --work_dir=/home/user/download/data-mesos --zk=zk://box01:2181,box02:2181,box03:2181/mesos --quorum=2
要启动mesos slave,我使用:

/home/user/download/mesos-0.24.0/build/bin/mesos-slave.sh --master=zk://box01:2181,box02:2181,box03:2181/mesos
编辑:

事实证明,如果我在box02(10.0.0.12)上运行独立的mesos master,并尝试从master(10.0.0.10)框运行框架,那么mesos master会收到框架运行请求作业,但不会执行该作业

主框框架日志

[root@master ~]# java -Djava.library.path=/usr/local/lib -jar /home/user/download/test-framework/example-framework-1.0-SNAPSHOT-jar-with-dependencies.jar box02:5050
I1103 13:44:21.898962 20958 sched.cpp:164] Version: 0.24.0
I1103 13:44:21.910660 20972 sched.cpp:262] New master detected at master@10.0.0.12:5050
I1103 13:44:21.913422 20972 sched.cpp:272] No credentials provided. Attempting to register without authentication

因此,zookeeper似乎与问题无关,但由于某种原因,主机无法将任何内容发送回执行框架的机器(mesos调度程序)。

根据您提供的主机日志,我猜主机无法打开与框架的连接。主日志的这部分看起来可疑:

I1103 13:44:21.513394 11288 master.cpp:2094] Received SUBSCRIBE call for framework 'framework-example' at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
I1103 13:44:21.513703 11288 master.cpp:2164] Subscribing framework framework-example with checkpointing disabled and capabilities [  ]
I1103 13:44:21.516088 11288 hierarchical.hpp:391] Added framework 20151103-134410-201326602-5050-11260-0000
I1103 13:44:21.517375 11288 master.cpp:4613] Sending 1 offers to framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
E1103 13:44:21.519042 11291 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not connected [107]
I1103 13:44:21.520539 11288 master.cpp:1051] Framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455 disconnected
I1103 13:44:21.520593 11288 master.cpp:2370] Disconnecting framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
I1103 13:44:21.520608 11288 master.cpp:2394] Deactivating framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
W1103 13:44:21.520922 11288 master.hpp:1409] Master attempted to send message to disconnected framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455

请检查框架节点上是否正确设置了
LIBPROCESS\u IP
变量,并且主机可以打开到框架节点的连接?

具有主机日志(来自故障转移和接管主机)框架日志将有助于解决这个问题。框架日志是我最初文章中的第一段代码。今天晚些时候我将提供两位大师的日志。我编辑了原始问题。我仍然没有看到活动大师的日志。请将其连接起来好吗?box02是活动主机。框架由另一个名为“master”的框执行。很抱歉设置混乱。请详细说明LIBPROCESS_IP变量的用途。据我所知,这是操作系统的一个环境变量。为什么我需要它?为什么它是通过操作系统而不是conf文件实现的?我在官方文件上没有发现任何有用的东西。我的问题是,是什么让你认为LIBPROCESS_IP变量是问题所在?我想在采取行动之前了解原因。谢谢。你可以看看这里:,这些链接没有太大帮助。我仍然无法从理论上解释应该发生什么以及LIBPROCESS_IP的用途。据我所知,这是框架将要使用的IP绑定,但是将其设置为环境变量似乎不合理。此外,使用以下命令运行框架
LIBPROCESS\u IP=10.0.0.10 java-Djava.library.path=/usr/local/lib-jar/home/user/download/test framework/example-framework-1.0-SNAPSHOT-jar-with-dependencies.jar zk://box01:2181,box02:2181,box03:2181/mesos
不起作用。当我在rhel 7.2中同时使用LIBPROCESS_IP和LIBPROCESS_端口时,它对我起作用,LIBPROCESS_IP不会单独对我起作用。
I1103 13:44:21.513394 11288 master.cpp:2094] Received SUBSCRIBE call for framework 'framework-example' at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
I1103 13:44:21.513703 11288 master.cpp:2164] Subscribing framework framework-example with checkpointing disabled and capabilities [  ]
I1103 13:44:21.516088 11288 hierarchical.hpp:391] Added framework 20151103-134410-201326602-5050-11260-0000
I1103 13:44:21.517375 11288 master.cpp:4613] Sending 1 offers to framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
E1103 13:44:21.519042 11291 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not connected [107]
I1103 13:44:21.520539 11288 master.cpp:1051] Framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455 disconnected
I1103 13:44:21.520593 11288 master.cpp:2370] Disconnecting framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
I1103 13:44:21.520608 11288 master.cpp:2394] Deactivating framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455
W1103 13:44:21.520922 11288 master.hpp:1409] Master attempted to send message to disconnected framework 20151103-134410-201326602-5050-11260-0000 (framework-example) at scheduler-a42792c3-3b5d-4bd3-a840-0e9ed4eaaab5@10.0.0.10:36455