Ignite 已检测到阻塞的系统关键线程

Ignite 已检测到阻塞的系统关键线程,ignite,Ignite,我正在使用Ignite.NET2.7.6。有一台服务器和大约40个客户端的配置。工作8小时后,服务器开始出现异常行为:客户端无法连接它,一些查询没有结果,等等 在服务器端,内存消耗正常,线程数量大约为250个,看起来都正常。我没有看到任何问题,所以我决定解决服务器端所有标记为严重的问题 我遇到的第一个问题是: 已检测到阻塞的系统关键线程。这可能导致集群范围内的未定义行为[threadName=tcp通信工作者,blockedFor=13s] 所以我想了解发生这种情况的原因。 可在此处找到完整的服

我正在使用Ignite.NET2.7.6。有一台服务器和大约40个客户端的配置。工作8小时后,服务器开始出现异常行为:客户端无法连接它,一些查询没有结果,等等

在服务器端,内存消耗正常,线程数量大约为250个,看起来都正常。我没有看到任何问题,所以我决定解决服务器端所有标记为严重的问题

我遇到的第一个问题是:

已检测到阻塞的系统关键线程。这可能导致集群范围内的未定义行为[threadName=tcp通信工作者,blockedFor=13s]

所以我想了解发生这种情况的原因。 可在此处找到完整的服务器日志:

添加: 这个问题似乎不是无伤大雅的,这个消息从不同的线程每秒出现一次,“blockedFor”值从秒增加到小时

服务器上的负载较低,但随着服务器的线程被锁定,它将停止响应和注册新客户端

以下是来自服务器的日志:

这是来自一个客户端的日志:


客户端日志的最后一行是19:03:52,服务器重新启动时。

我在其他异常中看到以下.NET特定异常,但它应该由另一个问题触发。不管怎么说,这个是

第一个例外与网络层面的通信问题有关。见下文:

java.io.IOException: Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå
    at sun.nio.ch.SocketDispatcher.read0(Native Method)
    at sun.nio.ch.SocketDispatcher.read(Unknown Source)
    at sun.nio.ch.IOUtil.readIntoNativeBuffer(Unknown Source)
    at sun.nio.ch.IOUtil.read(Unknown Source)
    at sun.nio.ch.SocketChannelImpl.read(Unknown Source)
    at org.apache.ignite.internal.util.nio.GridNioServer$DirectNioClientWorker.processRead(GridNioServer.java:1282)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.processSelectedKeysOptimized(GridNioServer.java:2386)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.bodyInternal(GridNioServer.java:2153)
    at org.apache.ignite.internal.util.nio.GridNioServer$AbstractNioClientWorker.body(GridNioServer.java:1794)
    at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
    at java.lang.Thread.run(Unknown Source)
[18:46:12,846][WARNING][grid-nio-worker-tcp-comm-0-#48][TcpCommunicationSpi] Closing NIO session because of unhandled exception [cls=class o.a.i.i.util.nio.GridNioException, msg=Óäàëåííûé õîñò ïðèíóäèòåëüíî ðàçîðâàë ñóùåñòâóþùåå ïîäêëþ÷åíèå]
[18:46:13,861][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=/127.0.0.1:47101, failureDetectionTimeout=10000]
[18:46:14,893][WARNING][tcp-comm-worker-#1][TcpCommunicationSpi] Connect timed out (consider increasing 'failureDetectionTimeout' configuration property) [addr=BB-SRV-DELTA/169.254.40.231:47101, failureDetectionTimeout=10000]
看起来服务器或某些客户端在10秒内不会对心跳或其他网络请求做出反应。还要检查客户端节点的日志。为了负载平衡,您可能需要扩展集群,添加更多服务器,或者调整
故障检测时间


已检测到被阻止的系统关键线程…
错误消息是无害的,但令人困惑。我已经重新启动了。

正如丹尼斯所描述的,有很多网络通信问题

通常,客户机希望执行一些缓存操作,但条带化池中的服务器线程会被阻塞很长时间。我认为这与.NET部分无关

您可以看到以下消息:

[18:53:04,385][SEVERE][tcp-disco-msg-worker-#2][G] Blocked system-critical thread has been detected. This can lead to cluster-wide undefined behaviour [threadName=sys-stripe-7, blockedFor=13s]
如果您查看线程:

hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(Unknown Source)
        at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
        at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
        at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)
线程正在尝试发送连续查询回调,但未能建立到客户端节点的连接。这会导致线程被阻塞,并且它无法为需要相同分区的其他缓存API请求提供服务


乍一看,您可以尝试减少
#clientFailureDetectionTimeout
,默认值为30秒。但这并不能完全解决网络问题。

关于第一个异常,我已经提出了这个问题:第二个异常是在代码中重新启动Ignite的客户端。这是一个解决办法,因为如果Ignite在失去连接后重新连接,有时会导致服务器出现奇怪的行为,如下面所述:我在问题中添加了一些细节非常有趣的评论,我将更深入地检查这一时刻。谢谢
hread [name="sys-stripe-7-#8", id=28, state=WAITING, blockCnt=51, waitCnt=3424]
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(Unknown Source)
        at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:178)
        at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.reserveClient(TcpCommunicationSpi.java:2911)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:2713)
        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:2672)
        at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1656)
        at o.a.i.i.managers.communication.GridIoManager.sendOrderedMessage(GridIoManager.java:1879)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1904)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1875)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendWithRetries(GridContinuousProcessor.java:1857)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.sendNotification(GridContinuousProcessor.java:1275)
        at o.a.i.i.processors.continuous.GridContinuousProcessor.addNotification(GridContinuousProcessor.java:1212)