Apache zookeeper 弗林克罐头';无法访问leader jobmanager,无法启动作业

Apache zookeeper 弗林克罐头';无法访问leader jobmanager,无法启动作业,apache-zookeeper,apache-flink,Apache Zookeeper,Apache Flink,我们的Flink集群有两个作业管理器。最近,每当jobmanager负责人更换时,工作经常会下降,而flink在更换后无法恢复以前的工作。当我重新启动flink群集时,作业也不能自动启动。所以我必须手动启动作业。根据日志,似乎每当选出新的jobmanager领导时,与新领导的连接都会被拒绝,这会导致无法启动所需的工作。在我们的jobmanager服务器上,我找不到打开的活动端口58088。我想知道弗林克和动物园管理员之间的谈话是否有问题。我们正在使用flink-1.0.3。 可能的原因是什么?这

我们的Flink集群有两个作业管理器。最近,每当jobmanager负责人更换时,工作经常会下降,而flink在更换后无法恢复以前的工作。当我重新启动flink群集时,作业也不能自动启动。所以我必须手动启动作业。根据日志,似乎每当选出新的jobmanager领导时,与新领导的连接都会被拒绝,这会导致无法启动所需的工作。在我们的jobmanager服务器上,我找不到打开的活动端口58088。我想知道弗林克和动物园管理员之间的谈话是否有问题。我们正在使用flink-1.0.3。 可能的原因是什么?这是弗林克虫吗?谢谢


日志:

2017-03-09 15:32:41211 INFO org.apache.flink.runtime.leaderretrieval.zookeperleaderretrievalservice-启动zookeperleaderretrievalservice。
2017-03-09 15:32:41243 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever-在akka下可以联系到新的领导者。tcp://flink@172.27.163.235:58088/用户/作业经理:36e428ba-0af3-4e39-90d4-106b7779f94a。
2017-03-09 15:32:41318警告远程处理-尝试与无法访问的远程地址[akka]关联。tcp://flink@172.27.163.235:58088]. 该地址现在被选通5000毫秒,所有发送到该地址的消息将以死信的形式发送。原因:连接被拒绝:/172.27.163.235:58088
2017-03-09 15:32:41325警告org.apache.flink.runtime.webmonitor.JobManagerRetriever-检索leader网关和端口失败。
akka.actor.ActorNotFound:actor未为找到actor:ActorSelection[Anchor(akka。tcp://flink@172.27.163.235:58088/),路径(/user/jobmanager)]
在akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
在akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
在scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
在akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
在akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
在akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
在akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
位于scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
在akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
在akka.dispatch.ExecutionContext$samethreadeExecutionContext$.unbatchedeExecute(Future.scala:74)
在akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
在akka.dispatch.ExecutionContext$samethreadeExecutionContext$.execute(Future.scala:73)
位于scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
在scala.concurrent.impl.Promise$DefaultPromise.tryComplete处(Promise.scala:248)
在akka.pattern.PromiseActorf.$bang(AskSupport.scala:267)
在akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
在akka。演员。死信演员。特约汉德尔(演员。斯卡拉:541)
在akka。演员。死信演员。$bang(演员。斯卡拉:531)
在akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
位于akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
在akka.actor.actor$class.aroundPostStop(actor.scala:475)
在akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
在akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
在akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
在akka.actor.ActorCell.terminate(ActorCell.scala:369)
在akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
在akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
在akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
在akka.dispatch.Mailbox.run(Mailbox.scala:220)
在akka.dispatch.Mailbox.exec(Mailbox.scala:231)
位于scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
位于scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
位于scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
在scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)中
2017-03-09 15:32:48029 INFO org.apache.flink.runtime.jobmanager.jobmanager-jobmanager-akka。tcp://flink@172.27.163.227:36876/用户/作业经理被授予了领导会话ID Some(ff50dc37-048e-4d95-93f5-df788c06725c)的领导权限。
2017-03-09 15:32:48037 INFO org.apache.flink.runtime.jobmanager.jobmanager-将所有作业的恢复延迟10000毫秒。
2017-03-09 15:32:48038 INFO org.apache.flink.runtime.webmonitor.JobManagerRetriever-在akka下可以联系到新的领导者。tcp://flink@172.27.163.227:36876/用户/作业经理:ff50dc37-048e-4d95-93f5-df788c06725c。
2017-03-09 15:32:49038 INFO org.apache.flink.runtime.instance.InstanceManager-在app87(akka)注册TaskManager。tcp://flink@172.27.165.66:32781/用户/任务管理器),如E3A9364CD8DEEBF8A757D3979C5AE55。当前注册的主机数为1。当前活动任务插槽数为4。
2017-03-09 15:32:49044 INFO org.apache.flink.runtime.instance.InstanceManager-在app83(akka)注册TaskManager。tcp://flink@172.27.165.58:40972/用户/任务管理器)作为0c980a4c64189d975aa71cb97b1ecb7c。当前注册的主机数为2。当前活动任务插槽数为8。
2017-03-09 15:32:49427 INFO org.apache.flink.runtime.instance.InstanceManager-在app27(akka)注册TaskManager。tcp://flink@172.27.164.5:50762/用户/任务管理器)作为7dcc90275bd63cbcda8361bfe00cb6e8。当前注册的主机数为3。当前活动任务插槽数为12。
2017-03-09 15:32:49676 INFO org.apache.flink.runtime.instance.Inst
2017-03-09 15:32:41,211 INFO  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - Starting ZooKeeperLeaderRetrievalService.
2017-03-09 15:32:41,243 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.27.163.235:58088/user/jobmanager:36e428ba-0af3-4e39-90d4-106b7779f94a.
2017-03-09 15:32:41,318 WARN  Remoting                                                      - Tried to associate with unreachable remote address [akka.tcp://flink@172.27.163.235:58088]. Address is now gated for 5000 ms, all messages to this address will be delivered to dead letters. Reason: Connection refused: /172.27.163.235:58088
2017-03-09 15:32:41,325 WARN  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - Failed to retrieve leader gateway and port.
akka.actor.ActorNotFound: Actor not found for: ActorSelection[Anchor(akka.tcp://flink@172.27.163.235:58088/), Path(/user/jobmanager)]
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:65)
    at akka.actor.ActorSelection$$anonfun$resolveOne$1.apply(ActorSelection.scala:63)
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.processBatch$1(BatchingExecutor.scala:67)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$mcV$sp(BatchingExecutor.scala:82)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
    at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(BatchingExecutor.scala:59)
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:72)
    at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.scala:58)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.unbatchedExecute(Future.scala:74)
    at akka.dispatch.BatchingExecutor$class.execute(BatchingExecutor.scala:110)
    at akka.dispatch.ExecutionContexts$sameThreadExecutionContext$.execute(Future.scala:73)
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:40)
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:248)
    at akka.pattern.PromiseActorRef.$bang(AskSupport.scala:267)
    at akka.actor.EmptyLocalActorRef.specialHandle(ActorRef.scala:508)
    at akka.actor.DeadLetterActorRef.specialHandle(ActorRef.scala:541)
    at akka.actor.DeadLetterActorRef.$bang(ActorRef.scala:531)
    at akka.remote.RemoteActorRefProvider$RemoteDeadLetterActorRef.$bang(RemoteActorRefProvider.scala:87)
    at akka.remote.EndpointWriter.postStop(Endpoint.scala:561)
    at akka.actor.Actor$class.aroundPostStop(Actor.scala:475)
    at akka.remote.EndpointActor.aroundPostStop(Endpoint.scala:415)
    at akka.actor.dungeon.FaultHandling$class.akka$actor$dungeon$FaultHandling$$finishTerminate(FaultHandling.scala:210)
    at akka.actor.dungeon.FaultHandling$class.terminate(FaultHandling.scala:172)
    at akka.actor.ActorCell.terminate(ActorCell.scala:369)
    at akka.actor.ActorCell.invokeAll$1(ActorCell.scala:462)
    at akka.actor.ActorCell.systemInvoke(ActorCell.scala:478)
    at akka.dispatch.Mailbox.processAllSystemMessages(Mailbox.scala:279)
    at akka.dispatch.Mailbox.run(Mailbox.scala:220)
    at akka.dispatch.Mailbox.exec(Mailbox.scala:231)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2017-03-09 15:32:48,029 INFO  org.apache.flink.runtime.jobmanager.JobManager                - JobManager akka.tcp://flink@172.27.163.227:36876/user/jobmanager was granted leadership with leader session ID Some(ff50dc37-048e-4d95-93f5-df788c06725c).
2017-03-09 15:32:48,037 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Delaying recovery of all jobs by 10000 milliseconds.
2017-03-09 15:32:48,038 INFO  org.apache.flink.runtime.webmonitor.JobManagerRetriever       - New leader reachable under akka.tcp://flink@172.27.163.227:36876/user/jobmanager:ff50dc37-048e-4d95-93f5-df788c06725c.
2017-03-09 15:32:49,038 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app87 (akka.tcp://flink@172.27.165.66:32781/user/taskmanager) as e3a9364cd8deeebf8a757d3979c5ae55. Current number of registered hosts is 1. Current number of alive task slots is 4.
2017-03-09 15:32:49,044 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app83 (akka.tcp://flink@172.27.165.58:40972/user/taskmanager) as 0c980a4c64189d975aa71cb97b1ecb7c. Current number of registered hosts is 2. Current number of alive task slots is 8.
2017-03-09 15:32:49,427 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app27 (akka.tcp://flink@172.27.164.5:50762/user/taskmanager) as 7dcc90275bd63cbcda8361bfe00cb6e8. Current number of registered hosts is 3. Current number of alive task slots is 12.
2017-03-09 15:32:49,676 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app26 (akka.tcp://flink@172.27.163.245:41734/user/taskmanager) as ab62098118261dcaa2d218ea17aa8117. Current number of registered hosts is 4. Current number of alive task slots is 16.
2017-03-09 15:32:49,916 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app84 (akka.tcp://flink@172.27.165.60:53871/user/taskmanager) as 012f186f437e7ba95111ff61d206dae6. Current number of registered hosts is 5. Current number of alive task slots is 20.
2017-03-09 15:32:49,930 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app85 (akka.tcp://flink@172.27.165.62:50068/user/taskmanager) as 68506e37647dfbff11ae193f20a7b624. Current number of registered hosts is 6. Current number of alive task slots is 24.
2017-03-09 15:32:50,658 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app86 (akka.tcp://flink@172.27.165.64:57339/user/taskmanager) as c1e922599fae53e6edc78a2add4edb61. Current number of registered hosts is 7. Current number of alive task slots is 28.
2017-03-09 15:32:50,780 INFO  org.apache.flink.runtime.instance.InstanceManager             - Registered TaskManager at app25 (akka.tcp://flink@172.27.163.241:45878/user/taskmanager) as 3ee2f5d3cb8003df5531d444bd11890c. Current number of registered hosts is 8. Current number of alive task slots is 32.
2017-03-09 15:32:58,054 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Attempting to recover all jobs.
2017-03-09 15:32:58,083 ERROR org.apache.flink.runtime.jobmanager.JobManager                - Fatal error: Failed to recover jobs.
java.io.FileNotFoundException: /apps/flink/recovery/submittedJobGraphb6357063f81b (No such file or directory)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at org.apache.flink.core.fs.local.LocalDataInputStream.<init>(LocalDataInputStream.java:52)
    at org.apache.flink.core.fs.local.LocalFileSystem.open(LocalFileSystem.java:143)
    at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:51)
    at org.apache.flink.runtime.state.filesystem.FileSerializableStateHandle.getState(FileSerializableStateHandle.java:35)
    at org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore.recoverJobGraphs(ZooKeeperSubmittedJobGraphStore.java:173)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply$mcV$sp(JobManager.scala:433)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2$$anonfun$apply$mcV$sp$2.apply(JobManager.scala:429)
    at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:429)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
    at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$handleMessage$1$$anonfun$applyOrElse$2.apply(JobManager.scala:425)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
    at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
    at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
    at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
    at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
    at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)