Apache flink zetcd的滚动重启导致Flink进程终止

Apache flink zetcd的滚动重启导致Flink进程终止,apache-flink,apache-curator,Apache Flink,Apache Curator,我正在AWS Fargate的集装箱中运行zetcd和Flink。zetcd群集包含三个节点。部署策略是一次替换一个节点以保持仲裁。由于无法连接到Zookeeper,部署到zetcd群集会导致Flink进程死亡 我观察到以下情况: 起始条件:具有三个节点的healty zetcd群集和一个健康的Flink群集 部署第一个zetcd节点时,如果某些Flink实例正在与此特定zetcd节点通信,则它们可能会失去与zookeeper的连接,但将恢复与其他正常zetcd节点的连接 部署第二个zetcd

我正在AWS Fargate的集装箱中运行zetcd和Flink。zetcd群集包含三个节点。部署策略是一次替换一个节点以保持仲裁。由于无法连接到Zookeeper,部署到zetcd群集会导致Flink进程死亡

我观察到以下情况:

  • 起始条件:具有三个节点的healty zetcd群集和一个健康的Flink群集
  • 部署第一个zetcd节点时,如果某些Flink实例正在与此特定zetcd节点通信,则它们可能会失去与zookeeper的连接,但将恢复与其他正常zetcd节点的连接
  • 部署第二个zetcd节点时,同上。此外,我观察到Flink从未尝试连接到新配置的zetcd节点
  • 部署最后一个zetcd节点时,Flink无法与zetcd重新建立连接,Flink进程终止
  • 重新设置所有Flink节点后,系统将返回正常状态
我认为Flink会在启动时缓存zetcd节点,并且Flink不知道zetcd节点的替换。一旦替换了所有初始zetcd节点,Flink将无法连接到zookeeper并死亡

00:42:22.892 [Curator-Framework-0] ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Connection timed out for connection string (zetcd-service.local:2181) and timeout (15000) / elapsed (15004)
org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [flink-dist_2.11-1.8.1.jar:1.8.1]
弗林克使用阿帕奇策展人;也许这种行为是馆长如何管理与动物园管理员的联系的产物

我很感激任何关于如何使Flink与当前zetcd节点列表保持同步的指导,或者如果我一开始就完全错了:)


相关
flink-conf.yaml

high-availability: zookeeper
high-availability.zookeeper.quorum: zetcd-service.local:2181
high-availability.storageDir: s3://flink-state/ha
high-availability.jobmanager.port: 6123
Flink与ZK失去连接,并尝试重新连接

00:42:07.788 [main-SendThread(ip-10-0-59-233.us-west-2.compute.internal:2181)] INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x79526ef2595a9606, likely server has closed socket, closing socket connection and attempting reconnect
00:42:07.888 [main-EventThread] INFO  org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@10.0.38.41:6123/user/dispatcher no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http://10.0.38.41:8081 no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN  org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink@10.0.38.41:6123/user/resourcemanager no longer participates in the leader election.
00:42:07.889 [Curator-PathChildrenCache-0] DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Received CONNECTION_SUSPENDED event
00:42:07.889 [Curator-PathChildrenCache-0] WARN  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).
00:42:08.820 [main-SendThread(ip-10-0-160-244.us-west-2.compute.internal:2181)] INFO  org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-10-0-160-244.us-west-2.compute.internal/10.0.160.244:2181
Flink无法连接到ZK节点,因此死亡

00:42:22.892 [Curator-Framework-0] ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Connection timed out for connection string (zetcd-service.local:2181) and timeout (15000) / elapsed (15004)
org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
    at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [flink-dist_2.11-1.8.1.jar:1.8.1]
    at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [flink-dist_2.11-1.8.1.jar:1.8.1]

你好,多穆罗。你找到解决这个问题的办法了吗。我们一直面临着同样的问题。