Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/assembly/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Java 配置Hazelcast CPSubsystem重试超时_Java_Spring_Spring Boot_Hazelcast - Fatal编程技术网

Java 配置Hazelcast CPSubsystem重试超时

Java 配置Hazelcast CPSubsystem重试超时,java,spring,spring-boot,hazelcast,Java,Spring,Spring Boot,Hazelcast,目前我在CPSubsystem中注册了三个实例 ----- | I1* | * Leader ----- ---- ---- | I2 | | I3 | ---- ---- 当所有实例都开始运行时,所有实例都已注册并在CPSubsystem上相互查看,一切都按预期工作。以下调用用于在所有实例之间执行分布式锁定: getHazelcastInstance().getCpSubsystem().getLock(lock

目前我在
CPSubsystem
中注册了三个实例

      ----- 
     | I1* | * Leader
      ----- 

 ----       ---- 
| I2 |     | I3 |
 ----       ---- 
当所有实例都开始运行时,所有实例都已注册并在
CPSubsystem
上相互查看,一切都按预期工作。以下调用用于在所有实例之间执行分布式锁定:

getHazelcastInstance().getCpSubsystem().getLock(lockDefinition.getLockEntryName())
我注意到一个问题,当其中两个实例死亡,并且没有领导人或其他实例可用于执行领导人选举时:

      ----- 
     | XXX | * DEAD
      ----- 

 ----       ----- 
| I2 |     | XXX | * DEAD
 ----       ----- 
然后,正在运行的实例尝试获取分布式锁,但请求执行
getLock
方法时冻结,导致请求排队数分钟(当实例成为子系统中的唯一实例时,需要配置超时)

我还注意到以下日志将永远打印:

2019-08-16 10:56:21.697  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:23.737  WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 106
2019-08-16 10:56:23.927  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:26.006  WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1468, invocationTime=1565963786004 (2019-08-16 10:56:26.004), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=130, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5701, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.core.MemberLeftException: Member [127.0.0.1]:5702 - ab45ea09-c8c9-4f03-b3db-42b7b440d161 this has left cluster!
2019-08-16 10:56:26.232  WARN 1337 --- [cached.thread-8] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 107
2019-08-16 10:56:26.413  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:27.143  WARN 1337 --- [onMonitorThread] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1479, invocationTime=1565963787142 (2019-08-16 10:56:27.142), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=140, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5703, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.spi.exception.TargetNotMemberException: Not Member! target: CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, partitionId: 81, operation: com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp, service: hz:core:raft
2019-08-16 10:56:28.835  WARN 1337 --- [cached.thread-6] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 108
2019-08-16 10:56:28.941  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.038  WARN 1337 --- [cached.thread-3] c.h.nio.tcp.TcpIpConnectionErrorHandler  : [127.0.0.1]:5702 [dev] [3.12.1] Removing connection to endpoint [127.0.0.1]:5701 Cause => java.net.SocketException {Connection refused to address /127.0.0.1:5701}, Error-Count: 109
2019-08-16 10:56:31.533  WARN 1337 --- [ration.thread-1] Impl$LeaderFailureDetectionTask(default) : [127.0.0.1]:5702 [dev] [3.12.1] We are FOLLOWER and there is no current leader. Will start new election round...
2019-08-16 10:56:31.555  WARN 1337 --- [.async.thread-3] c.h.s.i.o.impl.Invocation                : [127.0.0.1]:5702 [dev] [3.12.1] Retrying invocation: Invocation{op=com.hazelcast.cp.internal.operation.ChangeRaftGroupMembershipOp{serviceName='hz:core:raft', identityHash=1295439737, partitionId=81, replicaIndex=0, callId=1493, invocationTime=1565963791554 (2019-08-16 10:56:31.554), waitTimeout=-1, callTimeout=60000, groupId=CPGroupId{name='default', seed=0, commitIndex=6}, membersCommitIndex=0, member=CPMember{uuid=4792972d-d430-48f5-93ed-cb0e1fd8aed2, address=[127.0.0.1]:5703}, membershipChangeMode=REMOVE}, tryCount=250, tryPauseMillis=500, invokeCount=150, callTimeoutMillis=60000, firstInvocationTimeMs=1565963740657, firstInvocationTime='2019-08-16 10:55:40.657', lastHeartbeatMillis=0, lastHeartbeatTime='1969-12-31 21:00:00.000', target=[127.0.0.1]:5702, pendingResponse={VOID}, backupsAcksExpected=0, backupsAcksReceived=0, connection=null}, Reason: com.hazelcast.cp.exception.NotLeaderException: CPMember{uuid=ab45ea09-c8c9-4f03-b3db-42b7b440d161, address=[127.0.0.1]:5702} is not LEADER of CPGroupId{name='default', seed=0, commitIndex=6}. Known leader is: N/A
是否有办法确定实例现在正在单独运行,如果有,在获取新锁期间不要阻止应用程序


我希望某种机制不会以任何方式阻止应用程序的流动,即使应用程序单独运行,我也会使用常规的
j.u.c.l.ReentrantLock
而不是
FencedLock

CP子系统设计用于在没有足够的成员可用于首先形成CP子系统时,阻止属于CP子系统系列的数据结构上的所有操作。此属性由
CPSubsystemConfig.setCPMemberCount(int)

hazelcastInstance.getCPSubsystem().getCPSubsystemManagementService().getCPMembers()将为您提供群集中的CP成员


要确定集群成员数,可以使用
hazelcastInstance.getCluster().getMembers()
和/或使用MembershipListener处理加入或离开事件的成员。

CP子系统用于在没有足够的成员可用于首先形成CP子系统时阻止属于CP子系统系列的数据结构上的所有操作。此属性由
CPSubsystemConfig.setCPMemberCount(int)

hazelcastInstance.getCPSubsystem().getCPSubsystemManagementService().getCPMembers()将为您提供群集中的CP成员


要确定集群成员数,可以使用hazelcastInstance.getCluster().getMembers()
和/或使用MembershipListener来确定成员加入或离开事件。

经过几天的测试,我得出以下结论:

  • 但是,
    CPSubsystem
    至少需要三个模块才能开始工作,可以运行两个实例
  • 在我介绍的最具灾难性的场景中(只运行了一个实例),没有什么可做的,您的环境可能处于混乱状态,需要某种干预或注意来解决此中断
  • 我决定阻止满足请求,以防发生这种情况,从而保持模块之间所有操作的一致性

    这个决定是在阅读了大量材料后做出的(,,还模拟了过去的场景)

    因此,方法如下:

    try {
        if( !hz.isCpInstanceAvailable() ) {
            throw new HazelcastUnavailableException("CPSubsystem is not available");
        }
        ... acquires the lock ...
    } catch (HazelcastUnavailableException e) {
        LOG.error("Error retrieving Hazelcast Distributed Lock :( Please check the CPSubsystem health among all instances", e);
        throw e;
    }
    
    方法
    isCpInstanceAvailable
    将执行三个验证:

  • 如果当前应用程序已在
    CPSubsystem
  • 如果
    CPSubsystem
    启动
  • 如果
    CPSubsystem
  • 因此,解决方案如下:

    protected boolean isCpInstanceAvailable() {
        try {
            return getCPLocalMember() != null && getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS).size() > ONE_MEMBER;
        } catch (InterruptedException | ExecutionException | TimeoutException e) {
            LOG.error("Issue retrieving CP Members", e);
        }
    
        return false;
    }
    
    protected ICompletableFuture<Collection<CPMember>> getCPMembers() {
        return Optional.ofNullable(getCPSubsystemManagementService().getCPMembers()).orElseThrow(
                () -> new HazelcastUnavailableException("CP Members not available"));
    }
    
    protected CPMember getCPLocalMember() {
        return getCPSubsystemManagementService().getLocalCPMember();
    }
    
    受保护的布尔值isCpInstanceAvailable(){
    试一试{
    返回getCPLocalMember()!=null&&getCPMembers().get(getMemberValidationTimeout(),TimeUnit.SECONDS)。size()>一个成员;
    }捕获(InterruptedException | ExecutionException | TimeoutException e){
    日志错误(“检索CP成员的问题”,e);
    }
    返回false;
    }
    受保护的ICompletableFuture getCPMembers(){
    返回可选的.ofNullable(getCPSubsystemManagementService().getCPMembers()).OrelsThrow(
    ()->新的HazelcastUnavailableException(“CP成员不可用”);
    }
    受保护的CPMember getcplocamber(){
    返回getCPSubsystemManagementService().getLocalCPMember();
    }
    
    问题来了,只需调用
    getCPMembers().get()
    就会导致我经历的长时间暂停(默认超时)


    因此,我使用了
    getCPMembers().get(getMemberValidationTimeout(),TimeUnit.SECONDS)
    ,如果调用超过预期超时,它将抛出一个错误。

    经过几天的测试,我得出以下结论:

  • 但是,
    CPSubsystem
    至少需要三个模块才能开始工作,可以运行两个实例
  • 在我介绍的最具灾难性的场景中(只运行了一个实例),没有什么可做的,您的环境可能处于混乱状态,需要某种干预或注意来解决此中断
  • 我决定阻止满足请求,以防发生这种情况,从而保持模块之间所有操作的一致性

    这个决定是在阅读了大量材料后做出的(,,还模拟了过去的场景)

    因此,方法如下:

    try {
        if( !hz.isCpInstanceAvailable() ) {
            throw new HazelcastUnavailableException("CPSubsystem is not available");
        }
        ... acquires the lock ...
    } catch (HazelcastUnavailableException e) {
        LOG.error("Error retrieving Hazelcast Distributed Lock :( Please check the CPSubsystem health among all instances", e);
        throw e;
    }
    
    方法
    isCpInstanceAvailable
    将执行三个验证:

  • 如果当前应用程序已在
    CPSubsystem
  • 如果
    CPSubsystem
    启动
  • 如果
    CPSubsystem
  • 因此,解决方案如下:

    protected boolean isCpInstanceAvailable() {
        try {
            return getCPLocalMember() != null && getCPMembers().get(getMemberValidationTimeout(), TimeUnit.SECONDS).size() > ONE_MEMBER;
        } catch (InterruptedException | ExecutionException | TimeoutException e) {
            LOG.error("Issue retrieving CP Members", e);
        }
    
        return false;
    }
    
    protected ICompletableFuture<Collection<CPMember>> getCPMembers() {
        return Optional.ofNullable(getCPSubsystemManagementService().getCPMembers()).orElseThrow(
                () -> new HazelcastUnavailableException("CP Members not available"));
    }
    
    protected CPMember getCPLocalMember() {
        return getCPSubsystemManagementService().getLocalCPMember();
    }
    
    受保护的布尔值isCpInstanceAvailable(){
    试一试{
    返回getCPLocalMember()!=null&&getCPMembers().get(getMemberValidationTimeout(),TimeUnit.SECONDS)。size()>一个成员;
    }捕获(InterruptedException | ExecutionException | TimeoutException e)