Apache zookeeper Zookeeper在超时后未恢复

Apache zookeeper Zookeeper在超时后未恢复,apache-zookeeper,Apache Zookeeper,Zookeeper在超时后未正确恢复,并在未重新启动的情况下进入非工作状态 这是什么原因造成的,我们如何解决 配置 ZOO_TICK_TIME: 2000 ZOO_INIT_LIMIT: 30000 ZOO_SYNC_LIMIT: 10 ZOO_MAX_CLIENT_CNXNS: 2000 ZOO_STANDALONE_ENABLED: 'false' ZOO_AUTOPURGE_PURGEINTERVAL: 1

Zookeeper在超时后未正确恢复,并在未重新启动的情况下进入非工作状态

这是什么原因造成的,我们如何解决

配置

      ZOO_TICK_TIME: 2000
      ZOO_INIT_LIMIT: 30000
      ZOO_SYNC_LIMIT: 10
      ZOO_MAX_CLIENT_CNXNS: 2000
      ZOO_STANDALONE_ENABLED: 'false'
      ZOO_AUTOPURGE_PURGEINTERVAL: 1
      ZOO_AUTOPURGE_SNAPRETAINCOUNT: 10
      ZOO_LOG4J_PROP: INFO,ROLLINGFILE
日志


对于上面列出的典型症状类别,使用下面的列表检查您是否看到健康客户频繁断开连接

  • 网络连接问题​:

    • 使用ifconfig检查NIC上的许多错误。如果高,这个可以 说明延迟增加的原因
  • I/O饥饿​

    • 当需要高性能时,最好在专用节点上安装Zookeeper
    • 检查您的硬盘性能,例如使用hdparm-tT
    • 确保ZK事务日志目录(默认情况下为/opt/mapr/zkdata)位于fast专用驱动器上
    • 属性是$ZK_DIR/conf/zoo.cfg中的dataLogDir
    • 确保ZooKeeper堆大小不大于可用RAM,以避免交换
  • GC饥饿 ​​- 症状是频繁的客户端断开连接和由于心跳线程不足而导致的会话过期。
    • 使用jstat-gc查看是否有频繁的完全垃圾回收
    • 如果存在问题,请使用其他GC收集器,例如ConcurrentMarkSweep。为此,放置-XX:ParallelGCThreads=8-XX:+UseConcMarkSweepGC
  • 客户端超时
    • 免费检查客户端计算机上的交换-m
  • 如果设置为太低的值,可能会增加maxSessionTimeout
  • 如果未检测到明显的问题,请运行以检查问题是否与zookeeper有关
  • 虚拟环境—如果ZK集群位于某个共享主机上,这可能会导致资源不足并引入延迟

  • 参考链接:

    对于上面列出的典型症状类别,使用下面的列表检查您是否看到健康客户频繁断开连接

  • 网络连接问题​:

    • 使用ifconfig检查NIC上的许多错误。如果高,这个可以 说明延迟增加的原因
  • I/O饥饿​

    • 当需要高性能时,最好在专用节点上安装Zookeeper
    • 检查您的硬盘性能,例如使用hdparm-tT
    • 确保ZK事务日志目录(默认情况下为/opt/mapr/zkdata)位于fast专用驱动器上
    • 属性是$ZK_DIR/conf/zoo.cfg中的dataLogDir
    • 确保ZooKeeper堆大小不大于可用RAM,以避免交换
  • GC饥饿 ​​- 症状是频繁的客户端断开连接和由于心跳线程不足而导致的会话过期。
    • 使用jstat-gc查看是否有频繁的完全垃圾回收
    • 如果存在问题,请使用其他GC收集器,例如ConcurrentMarkSweep。为此,放置-XX:ParallelGCThreads=8-XX:+UseConcMarkSweepGC
  • 客户端超时
    • 免费检查客户端计算机上的交换-m
  • 如果设置为太低的值,可能会增加maxSessionTimeout
  • 如果未检测到明显的问题,请运行以检查问题是否与zookeeper有关
  • 虚拟环境—如果ZK集群位于某个共享主机上,这可能会导致资源不足并引入延迟
  • 参考链接:

    2020-01-1021:01:07900[myid:1]-INFO[QuorumPeer[myid=1](普通=/0.0.0:2181)(安全=禁用):Leader@464]-领导-领导选举-29毫秒

    2020-01-1021:01:07918[myid:1]-信息[QuorumPeer[myid=1](普通=/0.0.0:2181)(安全=禁用):FileTxnSnapLog@384]-快照:0x23007a7d69到/data/version-2/snapshot.23007a7d69

    告诉我,zk节点成为了领导者,并开始编写快照(之后,它可能会将快照发送给追随者)。然后节点将开始为查询提供服务。但是,如果快照很大(或者/data/version-2/底层硬件速度较慢),编写将花费大量时间。如果花费的时间超过
    maxSessionTimeout 40000
    ,则所有活动会话和所有临时数据都将丢失

    如果是这样,那么避免这种情况的唯一好办法就是减少zookeeper中存储的数据量。例如,将重blob存储在其他地方(文件/nfs/hdfs/aws/sql/http/…),并在zk中只写入时间戳(可能还有存储中的路径),这样应用程序仍然可以从zk快速接收数据更改通知。

    2020-01-1021:01:07900[myid:1]-INFO[QuorumPeer[myid=1](普通=/0.0.0:2181)(安全=禁用):Leader@464]-领导-领导选举-29毫秒

    2020-01-1021:01:07918[myid:1]-信息[QuorumPeer[myid=1](普通=/0.0.0:2181)(安全=禁用):FileTxnSnapLog@384]-快照:0x23007a7d69到/data/version-2/snapshot.23007a7d69

    告诉我,zk节点成为了领导者,并开始编写快照(之后,它可能会将快照发送给追随者)。然后节点将开始为查询提供服务。但是,如果快照很大(或者/data/version-2/底层硬件速度较慢),编写将花费大量时间。如果花费的时间超过
    maxSessionTimeout 40000
    ,则所有活动会话和所有临时数据都将丢失


    如果是这样,那么避免这种情况的唯一好办法就是减少zookeeper中存储的数据量。例如,将重blob存储在其他位置(文件/nfs/hdfs/aws/sql/http/…),并在zk中只写入时间戳(可能还有存储中的路径),这样应用程序仍然可以从zk快速接收数据更改通知。

    initLimit似乎太大了。实际上,它相当于60000秒,几乎17小时!?/data/version-2/?@YuryNevinitsin中的快照有多大存储在NVME ssd上大约1 GB initLimit似乎太大了。实际上,它相当于60000秒,几乎17小时!?b怎么样
    2020-01-10 20:58:57,473 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x2200061493 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x22 (n.peerEPoch), FOLLOWING (my state)0 (n.config version)
    2020-01-10 20:59:57,484 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x2200061493 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x22 (n.peerEPoch), FOLLOWING (my state)0 (n.config version)
    2020-01-10 21:00:57,495 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 2 (n.leader), 0x2200061493 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x22 (n.peerEPoch), FOLLOWING (my state)0 (n.config version)
    2020-01-10 21:01:07,510 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Follower@96] - Exception when following the leader
    java.net.SocketTimeoutException: Read timed out
            at java.net.SocketInputStream.socketRead0(Native Method)
            at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
            at java.net.SocketInputStream.read(SocketInputStream.java:171)
            at java.net.SocketInputStream.read(SocketInputStream.java:141)
            at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
            at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
            at java.io.DataInputStream.readInt(DataInputStream.java:387)
            at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
            at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:85)
            at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
            at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:158)
            at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:92)
            at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1271)
    2020-01-10 21:01:07,516 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Follower@201] - shutdown called
    java.lang.Exception: shutdown Follower
            at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:201)
            at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:1275)
    2020-01-10 21:01:07,516 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):LearnerZooKeeperServer@165] - Shutting down
    2020-01-10 21:01:07,517 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@558] - shutting down
    2020-01-10 21:01:07,517 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FollowerRequestProcessor@139] - Shutting down
    2020-01-10 21:01:07,518 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):CommitProcessor@362] - Shutting down
    2020-01-10 21:01:07,518 [myid:1] - INFO  [FollowerRequestProcessor:1:FollowerRequestProcessor@110] - FollowerRequestProcessor exited loop!
    2020-01-10 21:01:07,518 [myid:1] - INFO  [CommitProcessor:1:CommitProcessor@195] - CommitProcessor exited loop!
    2020-01-10 21:01:07,518 [myid:1] - INFO  [CommitProcessor:1:CommitProcessor@195] - CommitProcessor exited loop!
    2020-01-10 21:01:07,526 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FinalRequestProcessor@514] - shutdown of request processor complete
    2020-01-10 21:01:07,659 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):SyncRequestProcessor@191] - Shutting down
    2020-01-10 21:01:07,660 [myid:1] - INFO  [SyncThread:1:SyncRequestProcessor@169] - SyncRequestProcessor exited!
    2020-01-10 21:01:07,661 [myid:1] - WARN  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1318] - PeerState set to LOOKING
    2020-01-10 21:01:07,661 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1193] - LOOKING
    2020-01-10 21:01:07,664 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FastLeaderElection@885] - New election. My id =  1, proposed zxid=0x23007a7d69
    2020-01-10 21:01:07,666 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 1 (n.leader), 0x23007a7d69 (n.zxid), 0x2 (n.round), LOOKING (n.state), 1 (n.sid), 0x23 (n.peerEPoch), LOOKING (my state)0 (n.config version)
    2020-01-10 21:01:07,668 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 3 (n.leader), 0x2200061493 (n.zxid), 0x1 (n.round), LEADING (n.state), 3 (n.sid), 0x23 (n.peerEPoch), LOOKING (my state)0 (n.config version)
    2020-01-10 21:01:07,668 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 1 (n.leader), 0x23007a7d69 (n.zxid), 0x2 (n.round), LOOKING (n.state), 2 (n.sid), 0x23 (n.peerEPoch), LOOKING (my state)0 (n.config version)
    2020-01-10 21:01:07,870 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):QuorumPeer@1281] - LEADING
    2020-01-10 21:01:07,885 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@66] - TCP NoDelay set to: true
    2020-01-10 21:01:07,885 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@86] - zookeeper.leader.maxConcurrentSnapshots = 10
    2020-01-10 21:01:07,885 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@88] - zookeeper.leader.maxConcurrentSnapshotTimeout = 5
    2020-01-10 21:01:07,890 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@938] - minSessionTimeout set to 4000
    2020-01-10 21:01:07,890 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@947] - maxSessionTimeout set to 40000
    2020-01-10 21:01:07,890 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):ZooKeeperServer@166] - Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir /datalog/version-2 snapdir /data/version-2
    2020-01-10 21:01:07,900 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):Leader@464] - LEADING - LEADER ELECTION TOOK - 29 MS
    2020-01-10 21:01:07,918 [myid:1] - INFO  [QuorumPeer[myid=1](plain=/0.0.0.0:2181)(secure=disabled):FileTxnSnapLog@384] - Snapshotting: 0x23007a7d69 to /data/version-2/snapshot.23007a7d69
    2020-01-10 21:01:07,964 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,040 [myid:1] - WARN  [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,318 [myid:1] - WARN  [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,491 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,491 [myid:1] - WARN  [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,501 [myid:1] - WARN  [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,501 [myid:1] - WARN  [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,509 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,510 [myid:1] - WARN  [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,519 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@679] - Notification: 2 (message format version), 3 (n.leader), 0x23007a7d72 (n.zxid), 0x2 (n.round), LOOKING (n.state), 3 (n.sid), 0x23 (n.peerEPoch), LEADING (my state)0 (n.config version)
    2020-01-10 21:01:08,524 [myid:1] - WARN  [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,524 [myid:1] - WARN  [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,535 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,535 [myid:1] - WARN  [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,545 [myid:1] - WARN  [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,545 [myid:1] - WARN  [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,572 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,574 [myid:1] - WARN  [NIOWorkerThread-4:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,582 [myid:1] - WARN  [NIOWorkerThread-2:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,583 [myid:1] - WARN  [NIOWorkerThread-3:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,594 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,594 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    2020-01-10 21:01:08,595 [myid:1] - WARN  [NIOWorkerThread-1:NIOServerCnxn@370] - Exception causing close of session 0x0: ZooKeeperServer not running
    ...