Spring batch SpringXD:XD单节点模式下zookeeper超时时作业自动取消部署
我正在单节点模式下运行SpringXD。我的作业有作业模块,可以运行10分钟以上。当我启动作业时,会看到以下超时,然后是我的作业和模块的取消部署。 我需要更改任何zookeeper设置吗?请告知- 第一个错误Spring batch SpringXD:XD单节点模式下zookeeper超时时作业自动取消部署,spring-batch,spring-xd,Spring Batch,Spring Xd,我正在单节点模式下运行SpringXD。我的作业有作业模块,可以运行10分钟以上。当我启动作业时,会看到以下超时,然后是我的作业和模块的取消部署。 我需要更改任何zookeeper设置吗?请告知- 第一个错误 2015-01-26 20:30:22,514 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard fro
2015-01-26 20:30:22,514 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn - Client session timed out, have not heard from server in 40465ms for sessionid 0x14b28fc8fa80000, closing socket connection and attempting reconnect
2015-01-26 20:30:22,516 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] WARN org.apache.zookeeper.server.NIOServerCnxn - caught end of stream exception
EndOfStreamException: Unable to read additional data from client sessionid 0x14b28fc8fa80000, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:228)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:208)
at java.lang.Thread.run(Unknown Source)
2015-01-26 20:30:22,518 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.NIOServerCnxn - Closed socket connection for client /127.0.0.1:48599 which had sessionid 0x14b28fc8fa80000
2015-01-26 20:30:22,615 [main-EventThread] [] INFO org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
2015-01-26 20:30:22,616 [DeploymentSupervisorCacheListener-0] [] INFO org.springframework.xd.dirt.server.InitialDeploymentListener - Path cache event: type=CONNECTION_SUSPENDED
2015-01-26 20:30:22,616 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.DeploymentSupervisor - Admin admin:default,admin,singlenode:9393 connection suspended
2015-01-26 20:30:22,628 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.ContainerRegistrar - ZooKeeper connection suspended: 9adc3b5e-1b19-4d64-9b52-a5643dc42acb
2015-01-26 20:30:22,649 [LeaderSelector-0] [] INFO org.springframework.xd.dirt.server.DeploymentSupervisor - Leadership canceled due to thread interrupt
2015-01-26 20:30:22,650 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Path cache event: type=CONNECTION_SUSPENDED
2015-01-26 20:30:22,678 [DeploymentSupervisorCacheListener-0] [] INFO org.springframework.xd.dirt.server.InitialDeploymentListener - Path cache event: type=CONNECTION_SUSPENDED
2015-01-26 20:30:23,712 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-01-26 20:30:23,713 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to localhost/127.0.0.1:2181, initiating session
2015-01-26 20:30:23,713 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.NIOServerCnxnFactory - Accepted socket connection from /127.0.0.1:49181
2015-01-26 20:30:23,715 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.ZooKeeperServer - Client attempting to renew session 0x14b28fc8fa80000 at /127.0.0.1:49181
2015-01-26 20:30:23,715 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181] [] INFO org.apache.zookeeper.server.ZooKeeperServer - Established session 0x14b28fc8fa80000 with negotiated timeout 60000 for client /127.0.0.1:49181
2015-01-26 20:30:23,716 [main-SendThread(localhost:2181)] [] INFO org.apache.zookeeper.ClientCnxn - Session establishment complete on server localhost/127.0.0.1:2181, sessionid = 0x14b28fc8fa80000, negotiated timeout = 60000
2015-01-26 20:30:59,587 [main-EventThread] [] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (localhost:2181) and timeout (15000) / elapsed (36877)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:198)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:88)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:115)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:474)
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:302)
at org.apache.curator.framework.imps.GetDataBuilderImpl$4.call(GetDataBuilderImpl.java:291)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:107)
at org.apache.curator.framework.imps.GetDataBuilderImpl.pathInForeground(GetDataBuilderImpl.java:287)
at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:279)
at org.apache.curator.framework.imps.GetDataBuilderImpl.forPath(GetDataBuilderImpl.java:41)
at org.springframework.xd.dirt.server.DeploymentListener$JobModuleWatcher.process(DeploymentListener.java:527)
at org.apache.curator.framework.imps.NamespaceWatcher.process(NamespaceWatcher.java:67)
at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:522)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498)
然后是作业的联合国部署
2015-01-26 20:31:35,462 [LeaderSelector-1] [] INFO org.springframework.xd.dirt.server.DeploymentSupervisor - Leader Admin admin:default,admin,singlenode:9393 is watching for stream/job deployment requests.
2015-01-26 20:31:35,463 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.ContainerRegistrar - Waiting for supervisor to clean up prior deployments (elapsed time 0 seconds)...
2015-01-26 20:31:35,463 [ConnectionStateManager-0] [] INFO org.springframework.xd.dirt.server.ContainerRegistrar - Waiting for supervisor to clean up prior deployments (elapsed time 0 seconds)...
2015-01-26 20:31:35,478 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Path cache event: path=/deployments/modules/allocated/9adc3b5e-1b19-4d64-9b52-a5643dc42acb/c1_Job.job.custom-mod-hdfs.1, type=CHILD_REMOVED
2015-01-26 20:31:35,478 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Undeploying module [ModuleDescriptor@593edefc moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c1_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['table' -> 'c1', 'mode' -> 'initial'], children = list[[empty]]]
2015-01-26 20:31:35,516 [main-EventThread] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Undeploying module [ModuleDescriptor@4a83d881 moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c2_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['fetchSize' -> '100000', 'table' -> 'counts', 'mode' -> 'initial'], children = list[[empty]]]
2015-01-26 20:31:35,562 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Path cache event: path=/deployments/modules/allocated/9adc3b5e-1b19-4d64-9b52-a5643dc42acb/c2_job.job.custom-mod-hdfs.1, type=CHILD_REMOVED
2015-01-26 20:31:35,562 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Undeploying module [ModuleDescriptor@6d07b3ed moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c2_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['table' -> 'c2', 'mode' -> 'initial'], children = list[[empty]]]
2015-01-26 20:31:35,577 [DeploymentsPathChildrenCache-0] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Path cache event: path=/deployments/modules/allocated/9adc3b5e-1b19-4d64-9b52-a5643dc42acb/c3_job.custom-mod-hdfs.1, type=CHILD_REMOVED
2015-01-26 20:31:35,578 [main-EventThread] [] INFO org.springframework.xd.dirt.server.DeploymentListener - Undeploying module [ModuleDescriptor@13097137 moduleName = 'custom-mod-hdfs', moduleLabel = 'custom-mod-hdfs', group = 'c3_Job', sourceChannelName = [null], sinkChannelName = [null], sinkChannelName = [null], index = 0, type = job, parameters = map['table' -> 'c3', 'mode' -> 'initial'], children = list[[empty]]]
Spring XD单节点模式在与ZooKeeper“客户端”相同的JVM中运行嵌入式ZooKeeper服务器,即承载作业模块的应用程序上下文。如果心跳之间有40秒的间隔,这可能表示JVM正在经历严重的GC和/或主机已耗尽物理内存并正在交换到磁盘 为了验证这个理论,我建议启用详细gc。这可以通过修改
xd singlenode
脚本或在启动脚本之前设置环境变量export JAVA_OPTS=-verbose:gc
来完成
要修改会话和连接超时,可以设置以下JVM系统属性:
-会话超时策展人默认会话超时
-连接超时策展人默认连接超时
xd singlenode
脚本或在启动脚本之前设置环境变量export JAVA_OPTS=-verbose:gc
来完成
要修改会话和连接超时,可以设置以下JVM系统属性:
-会话超时策展人默认会话超时
-连接超时策展人默认连接超时
- 由于这个问题让我们非常头疼,我想在这里分享一下我们为解决这个问题所做的工作。我想你们中的一些人可能有类似的设置,我希望能为你们节省一些时间来追踪你们的具体问题
尤其是当您看到以下日志时:
09:48:21,467 1.1.1.RELEASE WARN SyncThread:0 persistence.FileTxnLog - fsync-ing the write ahead log in SyncThread:0 took 1123ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
你可能和我们遇到了同样的麻烦(只是一个有根据的猜测,让我们认为系统处于非常糟糕的状态)
实际上,GC暂停不仅仅是由我们的应用程序引起的。我们在单节点模式下运行XD(在需要扩展之前作为启动选项)并在服务器上运行MySQL数据库。我们主要使用XD调度作业,这些作业是由进入系统的外部事件触发的(因此在某种程度上是不可预测的)
最后我们的问题是
- 昂贵的数据库查询
- MySQL JDBC驱动程序加载到内存中的数据库结果
- Java垃圾收集
追踪这一点并不容易。首先,我们不得不通过在系统上引发大量事件,以某种方式使系统陷入麻烦。我们使用了htop,看到内存正常,CPU正常,但它告诉我们系统异常繁忙。所以IO被留下了——我们使用iotop来追踪MySQL是不是坏人。我们使用MySQLs“showfullprocesslist”命令跟踪查询。下一步是限制XD容器可用的堆,以便内存转储(使用jmap创建)适合我们的开发人员机器,并使用探查器(在我们的例子中是YourKit)来跟踪内存问题。在我们的例子中是MySQL,我们通过切换到受影响查询的游标获取模式来解决问题,例如,请参见,因为这个问题让我们非常头疼,我想在这里分享我们为解决这个问题所做的工作。我想你们中的一些人可能有类似的设置,我希望能为你们节省一些时间来追踪你们的具体问题 尤其是当您看到以下日志时:
09:48:21,467 1.1.1.RELEASE WARN SyncThread:0 persistence.FileTxnLog - fsync-ing the write ahead log in SyncThread:0 took 1123ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide
你可能和我们遇到了同样的麻烦(只是一个有根据的猜测,让我们认为系统处于非常糟糕的状态)
实际上,GC暂停不仅仅是由我们的应用程序引起的。我们在单节点模式下运行XD(在需要扩展之前作为启动选项)并在服务器上运行MySQL数据库。我们主要使用XD调度作业,这些作业是由进入系统的外部事件触发的(因此在某种程度上是不可预测的)
最后我们的问题是
- 昂贵的数据库查询
- MySQL JDBC驱动程序加载到内存中的数据库结果
- Java垃圾收集