Apache flink Flink HA群集作业管理器问题

Apache flink Flink HA群集作业管理器问题,apache-flink,flink-streaming,Apache Flink,Flink Streaming,我有一个flink 1.2群集的设置,由3个作业管理器和2个任务管理器组成。我从JobManager1启动Zookeeper仲裁,在其他两个JobManager上得到Zookeeper启动确认,然后在此JobManager1上启动Flink作业 flink-conf.yaml在所有5个虚拟机上都是相同的,这意味着jobmanager.rpc.address:在任何地方都指向JobManager1 如果我关闭运行JobManager1的VM,我希望Zookeeper会说剩下的一个JobManage

我有一个flink 1.2群集的设置,由3个作业管理器和2个任务管理器组成。我从JobManager1启动Zookeeper仲裁,在其他两个JobManager上得到Zookeeper启动确认,然后在此JobManager1上启动Flink作业

flink-conf.yaml在所有5个虚拟机上都是相同的,这意味着jobmanager.rpc.address:在任何地方都指向JobManager1

如果我关闭运行JobManager1的VM,我希望Zookeeper会说剩下的一个JobManager是领导者,TaskManager应该重新连接到它。相反,我在TaskManager的日志中看到了很多这样的消息

2017-03-14 14:13:21,827 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@1.2.3.4:43660/user/jobmanager (attempt 11, timeout: 30 seconds)
2017-03-14 14:13:21,836 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:43660] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:43660]] Caused by: [Connection refused: /1.2.3.4:43660]
为了保密,我将原始IP修改为1.2.3.4,因为它总是相同的IP(JobManager1)

更多日志:

2017-03-15 10:28:28,655 INFO  org.apache.flink.core.fs.FileSystem                           - Ensuring all FileSystem streams are closed for Async calls on Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:28:38,534 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 
2017-03-15 10:28:46,606 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:28:52,431 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:02,435 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,489 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
2017-03-15 10:29:10,490 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Cancelling all computations and discarding all cached data.
2017-03-15 10:29:10,491 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,491 INFO  org.apache.flink.runtime.taskmanager.Task                     - Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
    at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
    at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
    at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
    at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
    at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,512 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,515 INFO  org.apache.flink.runtime.taskmanager.Task                     - Attempting to fail task externally Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,515 INFO  org.apache.flink.runtime.taskmanager.Task                     - Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04) switched from RUNNING to FAILED.
java.lang.Exception: TaskManager akka://flink/user/taskmanager disconnects from JobManager akka.tcp://flink@1.2.3.4:44779/user/jobmanager: Old JobManager lost its leadership.
    at org.apache.flink.runtime.taskmanager.TaskManager.handleJobManagerDisconnect(TaskManager.scala:1074)
    at org.apache.flink.runtime.taskmanager.TaskManager.org$apache$flink$runtime$taskmanager$TaskManager$$handleJobManagerLeaderAddress(TaskManager.scala:1426)
    at org.apache.flink.runtime.taskmanager.TaskManager$$anonfun$handleMessage$1.applyOrElse(TaskManager.scala:286)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44)
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33)
    at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28)
    at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
    at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28)
    at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
    at org.apache.flink.runtime.taskmanager.TaskManager.aroundReceive(TaskManager.scala:122)
    at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
    at akka.actor.ActorCell.invoke(ActorCell.scala:487)
    at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-15 10:29:10,516 INFO  org.apache.flink.runtime.taskmanager.Task                     - Triggering cancellation of task code Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,516 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Disassociating from JobManager
2017-03-15 10:29:10,525 INFO  org.apache.flink.runtime.blob.BlobCache                       - Shutting down BlobCache
2017-03-15 10:29:10,542 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:10,546 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Source: Custom Source -> Flat Map (1/1) (75fd495cc6acfd72fbe957e60e513223).
2017-03-15 10:29:10,548 INFO  org.apache.flink.runtime.taskmanager.Task                     - Freeing task resources for Flat Map (1/1) (dd555e0437867c3180a1ecaf0a9f4d04).
2017-03-15 10:29:10,551 INFO  org.apache.flink.core.fs.FileSystem                           - Ensuring all FileSystem streams are closed for Flat Map (1/1)
2017-03-15 10:29:10,552 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Trying to register at JobManager akka.tcp://flink@1.2.3.5:43893/user/jobmanager (attempt 1, timeout: 500 milliseconds)
2017-03-15 10:29:10,567 INFO  org.apache.flink.core.fs.FileSystem                           - Ensuring all FileSystem streams are closed for Source: Custom Source -> Flat Map (1/1)
2017-03-15 10:29:10,632 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Successful registration at JobManager (akka.tcp://flink@1.2.3.5:43893/user/jobmanager), starting network stack and library cache.
2017-03-15 10:29:10,633 INFO  org.apache.flink.runtime.taskmanager.TaskManager              - Determined BLOB server address to be /1.2.3.5:42830. Starting BLOB cache.
2017-03-15 10:29:10,633 INFO  org.apache.flink.runtime.blob.BlobCache                       - Created BLOB cache storage directory /tmp/blobStore-d97e08db-d2f1-4f00-a7d1-30c2f5823934
2017-03-15 10:29:15,551 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:20,571 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:25,582 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
2017-03-15 10:29:30,592 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@1.2.3.4:44779] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://flink@1.2.3.4:44779]] Caused by: [Connection refused: /1.2.3.4:44779]
有人知道为什么TaskManager不尝试重新连接到剩余的一个JobManager(如上面的1.2.3.5)吗


谢谢

对于面临相同问题的每个人,HA要求您提供一个可从所有节点访问的DFS位置。我让后端状态检查点目录和zookeeper存储目录在每个VM上指向一个本地文件系统位置,当一个作业管理器宕机时,新领导无法恢复正在运行的作业,因为缺少信息/位置不可访问

编辑:因为有人问这个问题,所以我修改的文件(在ApacheFlink1.2()的情况下)是

我设定


到AWS S3路径。可从TaskManager和JobManager访问。

您是否遵循了此链接中的说明:以HA模式启动群集?您好,Dawid,是的,问题似乎是。事实上,我甚至创建了PR来修复它;)我会尽快把它合并。在合并之前,只需删除第一个内部
if
else
分支。你可以看看这里的公关:不管怎样,你能结束这两个问题中的一个吗?让我们不要这样污染;)好的,首先提供的代码不是一个“正在运行”的示例,但是无论如何,您都尝试使用kafka运行它,所有您需要的环境配置和作业顺利重新启动。如果没有来自所有TaskManager和JobManager的完整日志,我认为将无法帮助您。您继续编辑了哪些文件以解决此问题?conf/flink-conf.yaml(检查)您是如何继续修改该文件的,尤其是?如果你能考虑用这些额外的信息来扩展你当前的答案,这真的很有帮助。
conf/flink-conf.yaml
state.backend.fs.checkpointdir
high-availability.zookeeper.storageDir