Apache spark Spark有状态流作业在长时间正常运行后挂起到S3的检查点

Apache spark Spark有状态流作业在长时间正常运行后挂起到S3的检查点,apache-spark,amazon-s3,spark-streaming,Apache Spark,Amazon S3,Spark Streaming,我最近一直在对我们的Spark流媒体应用程序进行压力测试。压力测试每秒接收大约20000条消息,消息大小在200字节到1K之间,进入Kafka,Spark Streaming每4秒读取一批消息 我们的Spark集群在版本1.6.1上运行,带有独立的集群管理器,我们的代码使用Scala 2.10.6 在大约15-20小时的运行后,启动检查点(以40秒的间隔完成)的一个执行器被以下堆栈跟踪卡住,并且永远不会完成: java.net.SocketInputStream.socketRead0(本机方法

我最近一直在对我们的Spark流媒体应用程序进行压力测试。压力测试每秒接收大约20000条消息,消息大小在200字节到1K之间,进入Kafka,Spark Streaming每4秒读取一批消息

我们的Spark集群在版本1.6.1上运行,带有独立的集群管理器,我们的代码使用Scala 2.10.6

在大约15-20小时的运行后,启动检查点(以40秒的间隔完成)的一个执行器被以下堆栈跟踪卡住,并且永远不会完成:

java.net.SocketInputStream.socketRead0(本机方法) SocketInputStream.socketRead(SocketInputStream.java:116) java.net.SocketInputStream.read(SocketInputStream.java:170) java.net.SocketInputStream.read(SocketInputStream.java:141) sun.security.ssl.InputRecord.readFully(InputRecord.java:465) sun.security.ssl.InputRecord.readV3Record(InputRecord.java:593) sun.security.ssl.InputRecord.read(InputRecord.java:532) sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973) sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375) sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403) SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387) org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:533) org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:401) org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177) org.apache.http.impl.conn.AbstractPoolEntry.open(AbstractPoolEntry.java:144) org.apache.http.impl.conn.AbstractPooledConnAdapter.open(AbstractPooledConnAdapter.java:131) org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610) org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445) org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82) org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:326) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRequest(RestStorageService.java:277) org.jets3t.service.impl.rest.httpclient.RestStorageService.performRestHead(RestStorageService.java:1038) org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectImpl(RestStorageService.java:2250) org.jets3t.service.impl.rest.httpclient.RestStorageService.getObjectDetailsImpl(RestStorageService.java:2179) org.jets3t.service.StorageService.getObjectDetails(StorageService.java:1120) org.jets3t.service.StorageService.getObjectDetails(StorageService.java:575) org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:174) sun.reflect.GeneratedMethodAccessor32.invoke(未知源) sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) java.lang.reflect.Method.invoke(Method.java:497) org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) org.apache.hadoop.fs.s3native.$Proxy18.retrieveMetadata(未知) (来源) org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:472) org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1424) org.apache.spark.rdd.ReliableCheckpointRDD$.writePartitionToCheckpointFile(ReliableCheckpointRDD.scala:168) org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writerdTocheckPointDirectory$1.apply(ReliableCheckpointRDD.scala:136) org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writerdTocheckPointDirectory$1.apply(ReliableCheckpointRDD.scala:136) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) org.apache.spark.scheduler.Task.run(Task.scala:89) org.apache.spark.executor.executor$TaskRunner.run(executor.scala:214) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) run(Thread.java:745)

当被卡住时,spark驱动程序拒绝继续处理传入的批,并创建了大量积压的排队批,在释放“卡住”的任务之前无法处理这些批

此外,查看
streaming-job-executor-0
下的驱动程序线程转储,可以清楚地看到它正在等待此任务完成:

java.lang.Object.wait(本机方法) wait(Object.java:502) org.apache.spark.scheduler.JobWaiter.waitresult(JobWaiter.scala:73) org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:612) org.apache.spark.SparkContext.runJob(SparkContext.scala:1832) org.apache.spark.SparkContext.runJob(SparkContext.scala:1845) org.apache.spark.SparkContext.runJob(SparkContext.scala:1922) org.apache.spark.rdd.ReliableCheckpointRDD$.writerdTocheckPointDirectory(ReliableCheckpointRDD.scala:135) org.apache.spark.rdd.reliabledcheckpointdata.doCheckpoint(reliabledcheckpointdata.scala:58) org.apache.spark.rdd.RDDCheckpointData.checkpoint(RDDCheckpointData.scala:74) org.apache.spark.rdd.rdd$$anonfun$doCheckpoint$1.apply$mcV$sp(rdd.scala:1682) org.apache.spark.rdd.rdd$$anonfun$doCheckpoint$1.apply(rdd.scala:1679) org.apache.spark.rdd.rdd$$anonfun$doCheckpoint$1.apply(rdd.scala:1679) org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150) org.apache.spark.rdd.rdd.doCheckpoint(rdd.scala:1678) org.apache.spark.rdd.rdd$$anonfun$doCheckpoint$1$$anonfun$apply$mcV$sp$1.apply(rdd.scala:1684) org.apac
spark-submit  \
--conf "spark.speculation=true" \
--conf "spark.speculation.multiplier=5" \
CP=''; for f in /path/to/httpcomponents-client-4.5.2/lib/*.jar; do CP=$CP$f:; done
SPARK_CLASSPATH="$CP" sbin/start-master.sh   # on your master machine
SPARK_CLASSPATH="$CP" sbin/start-slave.sh 'spark://master_name:7077'