Java DNS解析永远挂起

Java DNS解析永远挂起,java,networking,dns,apache-zookeeper,apache-curator,Java,Networking,Dns,Apache Zookeeper,Apache Curator,我正在使用curator框架连接zookeeper服务器,但遇到了奇怪的DNS解析问题。这是线程的jstack转储 #21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000] java.lang.Thread.State: RUNNABLE at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method) at j

我正在使用curator框架连接zookeeper服务器,但遇到了奇怪的DNS解析问题。这是线程的jstack转储

#21 prio=5 os_prio=0 tid=0x0000000001888800 nid=0x3a46 runnable [0x00007f25e69f3000]
java.lang.Thread.State: RUNNABLE
    at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
    at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
    at java.net.InetAddress.getAllByName(InetAddress.java:1192)
    at java.net.InetAddress.getAllByName(InetAddress.java:1126)
    at org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117)
    at org.apache.zookeeper.client.StaticHostProvider.<init>(StaticHostProvider.java:81)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1096)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:1006)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:804)
    at org.apache.zookeeper.ZooKeeper.<init>(ZooKeeper.java:679)
    at com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:72)
    - locked <0x00000000fd761f40> (a com.netflix.curator.HandleHolder$1)
    at com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46)
    at com.netflix.curator.ConnectionState.reset(ConnectionState.java:122)
    at com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
    at com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137)
    at com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167)
#21优先级=5操作系统优先级=0 tid=0x0000000001888800 nid=0x3a46可运行[0x00007f25e69f3000]
java.lang.Thread.State:可运行
位于java.net.Inet4AddressImpl.lookupAllHostAddr(本机方法)
位于java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
位于java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
位于java.net.InetAddress.getAllByName0(InetAddress.java:1276)
位于java.net.InetAddress.getAllByName(InetAddress.java:1192)
位于java.net.InetAddress.getAllByName(InetAddress.java:1126)
位于org.apache.zookeeper.client.StaticHostProvider.resolveAndShuffle(StaticHostProvider.java:117)
位于org.apache.zookeeper.client.StaticHostProvider。(StaticHostProvider.java:81)
在org.apache.zookeeper.zookeeper.(zookeeper.java:1096)
在org.apache.zookeeper.zookeeper。(zookeeper.java:1006)
在org.apache.zookeeper.zookeeper。(zookeeper.java:804)
在org.apache.zookeeper.zookeeper.(zookeeper.java:679)
位于com.netflix.curator.HandleHolder$1.getZooKeeper(HandleHolder.java:72)
-锁定(com.netflix.curator.HandleHolder$1)
位于com.netflix.curator.HandleHolder.getZooKeeper(HandleHolder.java:46)
位于com.netflix.curator.ConnectionState.reset(ConnectionState.java:122)
位于com.netflix.curator.ConnectionState.start(ConnectionState.java:95)
在com.netflix.curator.CuratorZookeeperClient.start(CuratorZookeeperClient.java:137)
位于com.netflix.curator.framework.imps.CuratorFrameworkImpl.start(CuratorFrameworkImpl.java:167)

线程似乎卡在本机方法中,永远不会返回。而且它发生的非常随机,因此无法持续复制。有什么想法吗

我们也在努力解决这个问题。看起来这是由于glibc错误:或内核错误:取决于您问的人;)

也值得一读:和

要确认确实如此,请将gdb附加到java进程:

gdb --pid <JavaProcessPid>
查找执行recvmsg的线程:

thread <HangingThreadId>
如果您看到类似的情况,那么您知道glibc/内核升级将有助于:

#0  0x00007fc726ff27cd in recvmsg () from /lib64/libc.so.6
#1  0x00007fc727018765 in make_request () from /lib64/libc.so.6
#2  0x00007fc727018b9a in __check_pf () from /lib64/libc.so.6
#3  0x00007fc726fdbd57 in getaddrinfo () from /lib64/libc.so.6
#4  0x00007fc706dd9635 in Java_java_net_Inet6AddressImpl_lookupAllHostAddr () from /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-0.b17.el6_7.x86_64/jre/lib/amd64/libnet.so
更新:看起来内核赢了。有关详细信息,请参阅此线程。 还有一个工具可以验证您的系统是否受到内核错误的影响。您可以使用以下简单程序:

要验证:

curl -o pf_dump.c https://gist.githubusercontent.com/stevenschlansker/6ad46c5ccb22bc4f3473/raw/22cfe72f6708de1e3468c1e0fa3888aafae42db4/pf_dump.c
gcc pf_dump.c -pthread -o pf_dump
./pf_dump
如果输出为:

[26170] glibc: check_pf: netlink socket read timeout
Aborted
然后系统就会受到影响。如果输出类似于:

exit success [7618] exit success [7265] exit success
那么系统就正常了。
在AWS环境下,使用新内核将AMIs升级到(2016.3.2)似乎解决了这个问题

不确定这是否是DNS的问题。请检查此问题:我目前在随机时间遇到相同的问题。我们定义了Djava.net.preferIPv4Stack=true,并在RedHat服务器上运行。我们可以定义DNS解析调用的超时时间吗?请不要只写链接应答。可以将其作为评论,也可以在文本中包含重要部分。是的,glibc upgrade解决了这个问题!我忘记更新这个帖子了。谢谢@Jacek Tomaka。我认为
curl-O
应该是
curl-O
。对于那些有AMI背景的人(例如,我使用2014.03)并发出重新声明,但上面的脚本说“退出成功”,这意味着对于那些有AMI的人来说,使用gist脚本检查是不正确的。
[26170] glibc: check_pf: netlink socket read timeout
Aborted
exit success [7618] exit success [7265] exit success