Java Hazelcast服务器心跳延迟，无法与客户端连接_Java_Caching_Hazelcast_Hazelcast Imap

Java Hazelcast服务器心跳延迟，无法与客户端连接

java caching hazelcast

Java Hazelcast服务器心跳延迟，无法与客户端连接,java,caching,hazelcast,hazelcast-imap,Java,Caching,Hazelcast,Hazelcast Imap,我用Hazelcast-3.5.1做了一些测试，它在1GB堆上运行得很好，但当堆大小增加到4GB时，它工作得很好，直到内存使用达到80%左右，心跳响应延迟迅速增长。（根据日志，如-自上次心跳以来，系统时钟显然从“”跳到“”）一段时间后，客户端/mancenter无法连接，服务器jvm也会崩溃平台-Ubuntu 14.04（8GB） Hazelcast-3.5.1 Java-1.7.0_79 服务器日志： > Aug 20, 2015 5:22:07 PM com.hazelcast.c

我用Hazelcast-3.5.1做了一些测试，它在1GB堆上运行得很好，但当堆大小增加到4GB时，它工作得很好，直到内存使用达到80%左右，心跳响应延迟迅速增长。（根据日志，如-自上次心跳以来，系统时钟显然从“”跳到“”）

一段时间后，客户端/mancenter无法连接，服务器jvm也会崩溃

平台-Ubuntu 14.04（8GB） Hazelcast-3.5.1 Java-1.7.0_79

服务器日志：

> Aug 20, 2015 5:22:07 PM com.hazelcast.cluster.ClusterService INFO:
> [172.17.42.1]:5701 [dev] [3.5.1] System clock apparently jumped from
> 2015-08-20T17:21:54.376 to 2015-08-20T17:22:07.564 since last
> heartbeat (+12188ms). Aug 20, 2015 5:22:07 PM
> com.hazelcast.internal.monitors.HealthMonitor INFO: [172.17.42.1]:5701
> [dev] [3.5.1] processors=4, physical.memory.total=7.5G,
> physical.memory.free=347.9M, swap.space.total=7.7G,
> swap.space.free=7.6G, heap.memory.used=3.5G, heap.memory.free=301.1M,
> heap.memory.total=3.8G, heap.memory.max=3.8G,
> heap.memory.used/total=92.28%, heap.memory.used/max=92.28%,
> minor.gc.count=33, minor.gc.time=9148ms, major.gc.count=22,
> major.gc.time=152120ms, load.process=85.00%, load.system=95.00%,
> load.systemAverage=4.77, thread.count=54, thread.peakCount=56,
> cluster.timeDiff=0, event.q.size=0, executor.q.async.size=10,
> executor.q.client.size=0, executor.q.query.size=0,
> executor.q.scheduled.size=0, executor.q.io.size=0,
> executor.q.system.size=0, executor.q.operation.size=10,
> executor.q.priorityOperation.size=0, executor.q.response.size=0,
> operations.remote.size=0, operations.running.size=0,
> operations.pending.invocations.count=0,
> operations.pending.invocations.percentage=0.00%, proxy.count=1,
> clientEndpoint.count=1, connection.active.count=1,
> client.connection.count=1, connection.count=0 Aug 20, 2015 5:22:22 PM
> com.hazelcast.cluster.ClusterService INFO: [172.17.42.1]:5701 [dev]
> [3.5.1] System clock apparently jumped from 2015-08-20T17:22:08.564 to
> 2015-08-20T17:22:22.467 since last heartbeat (+12903ms). Aug 20, 2015
> 6:13:59 PM com.hazelcast.cluster.ClusterService INFO:
> [172.17.42.1]:5701 [dev] [3.5.1] System clock apparently jumped from
> 2015-08-20T18:10:46.846 to 2015-08-20T18:13:59.477 since last
> heartbeat (+191631ms). Aug 20, 2015 6:13:04 PM
> com.hazelcast.cluster.ClusterService INFO: [172.17.42.1]:5701 [dev]
> [3.5.1] System clock apparently jumped from 2015-08-20T18:10:46.846 to
> 2015-08-20T18:12:17.580 since last heartbeat (+89734ms). Aug 20, 2015
> 6:15:55 PM com.hazelcast.cluster.ClusterService INFO:
> [172.17.42.1]:5701 [dev] [3.5.1] System clock apparently jumped from
> 2015-08-20T18:12:17.580 to 2015-08-20T18:15:55.876 since last
> heartbeat (+217296ms). Aug 20, 2015 6:15:55 PM
> com.hazelcast.client.ClientEndpointManager INFO: [172.17.42.1]:5701
> [dev] [3.5.1] Destroying ClientEndpoint{conn=Connection
> [0.0.0.0/0.0.0.0:5701 -> /127.0.0.1:59157],
> endpoint=Address[127.0.0.1]:59157, live=false, type=JAVA_CLIENT,
> principal='ClientPrincipal{uuid='d32f7c02-afb4-4212-8f1b-118c526f3e05',
> ownerUuid='be3921ba-52fa-4939-a029-6afb5013c25a'}',
> firstConnection=true, authenticated=true} Aug 20, 2015 6:12:58 PM
> com.hazelcast.nio.tcp.SocketAcceptor INFO: [172.17.42.1]:5701 [dev]
> [3.5.1] Accepting socket connection from /127.0.0.1:59171 Aug 20, 2015
> 6:15:55 PM com.hazelcast.cluster.ClusterService INFO:
> [172.17.42.1]:5701 [dev] [3.5.1] System clock apparently jumped from
> 2015-08-20T18:13:59.477 to 2015-08-20T18:15:55.875 since last
> heartbeat (+115398ms). Aug 20, 2015 6:15:55 PM
> com.hazelcast.internal.monitors.HealthMonitor INFO: [172.17.42.1]:5701
> [dev] [3.5.1] processors=4, physical.memory.total=7.5G,
> physical.memory.free=401.2M, swap.space.total=7.7G,
> swap.space.free=7.6G, heap.memory.used=3.6G, heap.memory.free=188.9M,
> heap.memory.total=3.8G, heap.memory.max=3.8G,
> heap.memory.used/total=95.16%, heap.memory.used/max=95.16%,
> minor.gc.count=33, minor.gc.time=9148ms, major.gc.count=515,
> major.gc.time=3369835ms, load.process=94.00%, load.system=97.00%,
> load.systemAverage=4.12, thread.count=46, thread.peakCount=67,
> cluster.timeDiff=0, event.q.size=0, executor.q.async.size=0,
> executor.q.client.size=0, executor.q.query.size=0,
> executor.q.scheduled.size=3, executor.q.io.size=0,
> executor.q.system.size=0, executor.q.operation.size=0,
> executor.q.priorityOperation.size=0, executor.q.response.size=0,
> operations.remote.size=0, operations.running.size=0,
> operations.pending.invocations.count=0,
> operations.pending.invocations.percentage=0.00%, proxy.count=1,
> clientEndpoint.count=0, connection.active.count=0,
> client.connection.count=0, connection.count=0 Aug 20, 2015 6:15:55 PM
> com.hazelcast.nio.tcp.WriteHandler WARNING: [172.17.42.1]:5701 [dev]
> [3.5.1] hz._hzInstance_1_dev.IO.thread-out-0 Closing socket to
> endpoint Address[127.0.0.1]:59044,
> Cause:java.nio.channels.CancelledKeyException
> java.nio.channels.CancelledKeyException   at
> sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)     at
> sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)     at
> com.hazelcast.nio.tcp.AbstractSelectionHandler.unregisterOp(AbstractSelectionHandler.java:99)
>   at
> com.hazelcast.nio.tcp.WriteHandler.unschedule(WriteHandler.java:197)
>   at com.hazelcast.nio.tcp.WriteHandler.handle(WriteHandler.java:252)
>   at com.hazelcast.nio.tcp.WriteHandler.run(WriteHandler.java:331)    at
> com.hazelcast.nio.tcp.AbstractIOSelector.executeTask(AbstractIOSelector.java:104)
>   at
> com.hazelcast.nio.tcp.AbstractIOSelector.processSelectionQueue(AbstractIOSelector.java:97)
>   at
> com.hazelcast.nio.tcp.AbstractIOSelector.run(AbstractIOSelector.java:123)
> WARNING: [172.17.42.1]:5701 [dev] [3.5.1] Resetting master
> confirmation timestamps because of huge system clock jump! Clock-Jump:
> 661318ms, Master-Confirmation-Timeout: 500000ms.
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f49ba237122, pid=13165, tid=139954432919296
> #
> # JRE version: OpenJDK Runtime Environment (7.0_79-b14) (build 1.7.0_79-b14)
> # Java VM: OpenJDK 64-Bit Server VM (24.79-b02 mixed mode linux-amd64 compressed oops)
> # Derivative: IcedTea 2.5.6
> # Distribution: Ubuntu 14.04 LTS, package 7u79-2.5.6-0ubuntu1.14.04.1
> # Problematic frame:
> # V  [libjvm.so+0x82b122]  ParallelCompactData::calc_new_pointer(HeapWord*)+0x32
> #
> # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again

看起来这是Ubuntu的问题

我的旧AWS托管的para虚拟实例也存在同样的问题。修复方法是迁移到较新的HVM实例

关于AWS已知问题的更多信息：

问题之一是，为什么心跳信息需要如此长的时间才能通过。是否可以使用Hazelcast 3.6-SNAPSHOT重现问题？我添加了一个内部度量，它提供了关于内部的非常详细的信息，包括连接的所有队列大小。通过这种方式，我们可以更详细地了解正在发生的事情。3.6中的健康监视器使用了新的度量系统，但只显示了一小部分信息。我可以帮你设置。