Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/7/kubernetes/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache spark Kubernetes上的Spark Standalone-应用程序在连续主控然后驱动程序失败后完成_Apache Spark_Kubernetes_Spark Streaming_Apache Spark Standalone_Kubernetes Statefulset - Fatal编程技术网

Apache spark Kubernetes上的Spark Standalone-应用程序在连续主控然后驱动程序失败后完成

Apache spark Kubernetes上的Spark Standalone-应用程序在连续主控然后驱动程序失败后完成,apache-spark,kubernetes,spark-streaming,apache-spark-standalone,kubernetes-statefulset,Apache Spark,Kubernetes,Spark Streaming,Apache Spark Standalone,Kubernetes Statefulset,尝试使用Zookeper实现SparkMaster的高可用性,并使用GlusterFS中的元数据检查点实现SparkDriver弹性 一些信息: 使用Spark 2.2.0(预构建二进制) 从一个单独的spark客户端提交一个流式应用程序,其中包含--deploy mode cluster和--supervise Kubernetes上的Spark组件的类型为Statefulset,用于动态卷资源调配(以前使用Replication Controller/Deployment) 创建了3个Gl

尝试使用Zookeper实现SparkMaster的高可用性,并使用GlusterFS中的元数据检查点实现SparkDriver弹性

一些信息:

  • 使用Spark 2.2.0(预构建二进制)
  • 从一个单独的spark客户端提交一个流式应用程序,其中包含--deploy mode cluster和--supervise
  • Kubernetes上的Spark组件的类型为Statefulset,用于动态卷资源调配(以前使用Replication Controller/Deployment)
  • 创建了3个GlusterFS共享pvc-spark master pvc、spark worker pvc、spark ckp pvc
成功实现了以下场景:仅主控器故障、仅驱动程序故障、连续主控器和驱动程序故障、驱动程序故障和主控器故障。但是像提交作业->主机故障(工作正常)->驱动程序故障,即工人吊舱故障这样的场景不起作用

新的主日志-

18/06/11 10:23:16 INFO ZooKeeperLeaderElectionAgent: We have gained leadership
18/06/11 10:23:16 INFO Master: I have been elected leader! New state: RECOVERING
18/06/11 10:23:16 INFO Master: Trying to recover app: app-20180611102123-0001
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611101834-10.1.53.142-36203
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611102123-10.1.170.85-39447
18/06/11 10:23:16 INFO Master: Trying to recover worker: worker-20180611101834-10.1.185.87-38235
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.53.142:36203 after 7 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.185.87:38235 after 3 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.53.142:38994 after 12 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO TransportClientFactory: Successfully created connection to /10.1.170.85:39447 after 7 ms (0 ms spent in bootstraps)
18/06/11 10:23:16 INFO Master: Application has been re-registered: app-20180611102123-0001
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611102123-10.1.170.85-39447
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611101834-10.1.53.142-36203
18/06/11 10:23:16 INFO Master: Worker has been re-registered: worker-20180611101834-10.1.185.87-38235
18/06/11 10:23:16 INFO Master: Recovery complete - resuming operations!
18/06/11 10:24:37 INFO Master: Received unregister request from application app-20180611102123-0001
18/06/11 10:24:37 INFO Master: Removing app app-20180611102123-0001
18/06/11 10:24:37 INFO Master: 10.1.53.142:38994 got disassociated, removing it.
18/06/11 10:24:37 INFO Master: 10.1.53.142:38994 got disassociated, removing it.
18/06/11 10:24:37 WARN Master: Got status update for unknown executor app-20180611102123-0001/0
18/06/11 10:24:37 WARN Master: Got status update for unknown executor app-20180611102123-0001/1
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: Removing worker worker-20180611101834-10.1.53.142-36203 on 10.1.53.142:36203
18/06/11 10:24:38 INFO Master: Re-launching driver-20180611102017-0000
18/06/11 10:24:38 INFO Master: Launching driver driver-20180611102017-0000 on worker worker-20180611101834-10.1.185.87-38235
18/06/11 10:24:38 INFO Master: 10.1.53.142:59142 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:38 INFO Master: 10.1.53.142:36203 got disassociated, removing it.
18/06/11 10:24:43 INFO Master: Registering worker 10.1.53.143:35156 with 8 cores, 30.3 GB RAM
驱动程序仍处于停止状态。驱动程序错误日志-

log4j:WARN No appenders could be found for logger (org.apache.hadoop.metrics2.lib.MutableMetricsFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/06/11 19:32:14 INFO SecurityManager: Changing view acls to: root
18/06/11 19:32:14 INFO SecurityManager: Changing modify acls to: root
18/06/11 19:32:14 INFO SecurityManager: Changing view acls groups to: 
18/06/11 19:32:14 INFO SecurityManager: Changing modify acls groups to: 
18/06/11 19:32:14 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
18/06/11 19:32:15 INFO Utils: Successfully started service 'Driver' on port 40594.
18/06/11 19:32:15 INFO WorkerWatcher: Connecting to worker spark://Worker@10.1.185.87:38235
18/06/11 19:32:15 INFO TransportClientFactory: Successfully created connection to /10.1.185.87:38235 after 44 ms (0 ms spent in bootstraps)
18/06/11 19:32:15 INFO WorkerWatcher: Successfully connected to spark://Worker@10.1.185.87:38235
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint files found: file:/ckp/checkpoint-1528712675000,file:/ckp/checkpoint-1528712675000.bk,file:/ckp/checkpoint-1528712670000,file:/ckp/checkpoint-1528712670000.bk,file:/ckp/checkpoint-1528712665000,file:/ckp/checkpoint-1528712665000.bk,file:/ckp/checkpoint-1528712660000,file:/ckp/checkpoint-1528712660000.bk,file:/ckp/checkpoint-1528712655000,file:/ckp/checkpoint-1528712655000.bk
18/06/11 19:32:15 INFO CheckpointReader: Attempting to load checkpoint from file file:/ckp/checkpoint-1528712675000
18/06/11 19:32:15 INFO Checkpoint: Checkpoint for time 1528712675000 ms validated
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint successfully loaded from file file:/ckp/checkpoint-1528712675000
18/06/11 19:32:15 INFO CheckpointReader: Checkpoint was generated at time 1528712675000 ms
18/06/11 19:32:15 INFO SparkContext: Running Spark version 2.2.0
18/06/11 19:32:15 INFO SparkContext: Submitted application: SparkStreamingWithCheckPointAndZK
18/06/11 19:32:15 INFO SecurityManager: Changing view acls to: root
18/06/11 19:32:15 INFO SecurityManager: Changing modify acls to: root
18/06/11 19:32:15 INFO SecurityManager: Changing view acls groups to: 
18/06/11 19:32:15 INFO SecurityManager: Changing modify acls groups to: 
18/06/11 19:32:15 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(root); groups with view permissions: Set(); users  with modify permissions: Set(root); groups with modify permissions: Set()
18/06/11 19:32:15 INFO Utils: Successfully started service 'sparkDriver' on port 46544.
18/06/11 19:32:15 INFO SparkEnv: Registering MapOutputTracker
18/06/11 19:32:15 INFO SparkEnv: Registering BlockManagerMaster
18/06/11 19:32:15 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
18/06/11 19:32:15 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
18/06/11 19:32:16 INFO DiskBlockManager: Created local directory at /tmp/blockmgr-623c4b9e-8045-4a19-a746-96a3b23c1184
18/06/11 19:32:16 INFO MemoryStore: MemoryStore started with capacity 366.3 MB
18/06/11 19:32:16 INFO SparkEnv: Registering OutputCommitCoordinator
18/06/11 19:32:16 INFO Utils: Successfully started service 'SparkUI' on port 4040.
18/06/11 19:32:16 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://10.1.185.87:4040
18/06/11 19:32:16 INFO SparkContext: Added JAR file:///opt/spark/jars/spark-0.0.1-SNAPSHOT.jar at spark://10.1.185.87:46544/jars/spark-0.0.1-SNAPSHOT.jar with timestamp 1528745536460
18/06/11 19:32:16 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:32:36 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:32:56 INFO StandaloneAppClient$ClientEndpoint: Connecting to master spark://10.1.170.81:7077...
18/06/11 19:33:16 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.
18/06/11 19:33:16 WARN StandaloneSchedulerBackend: Application ID is not initialized yet.
18/06/11 19:33:16 INFO SparkUI: Stopped Spark web UI at http://10.1.185.87:4040
18/06/11 19:33:16 INFO StandaloneSchedulerBackend: Shutting down all executors
18/06/11 19:33:16 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46323.
18/06/11 19:33:16 INFO NettyBlockTransferService: Server created on 10.1.185.87:46323
18/06/11 19:33:16 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/06/11 19:33:16 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
18/06/11 19:33:16 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 WARN StandaloneAppClient$ClientEndpoint: Drop UnregisterApplication(null) because has not yet connected to master
18/06/11 19:33:16 INFO BlockManagerMasterEndpoint: Registering block manager 10.1.185.87:46323 with 366.3 MB RAM, BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.1.185.87, 46323, None)
18/06/11 19:33:16 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/06/11 19:33:16 INFO MemoryStore: MemoryStore cleared
18/06/11 19:33:16 INFO BlockManager: BlockManager stopped
18/06/11 19:33:16 INFO BlockManagerMaster: BlockManagerMaster stopped
18/06/11 19:33:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/06/11 19:33:16 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: requirement failed: Can only call getServletHandlers on a running MetricsSystem
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:141)
at apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:829)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:626)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.merlin.spark.SparkKafkaStreamingWithGluster.main(SparkKafkaStreamingWithGluster.java:42)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:58)
at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)
18/06/11 19:33:16 INFO SparkContext: SparkContext already stopped.

Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.metrics.MetricsSystem.getServletHandlers(MetricsSystem.scala:91)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:524)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2509)
at org.apache.spark.streaming.StreamingContext.<init>(StreamingContext.scala:141)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at org.apache.spark.streaming.StreamingContext$$anonfun$getOrCreate$1.apply(StreamingContext.scala:829)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.StreamingContext$.getOrCreate(StreamingContext.scala:829)
at org.apache.spark.streaming.api.java.JavaStreamingContext$.getOrCreate(JavaStreamingContext.scala:626)
at org.apache.spark.streaming.api.java.JavaStreamingContext.getOrCreate(JavaStreamingContext.scala)
at org.merlin.spark.SparkKafkaStreamingWithGluster.main(SparkKafkaStreamingWithGluster.java:42)
... 6 more
log4j:WARN找不到记录器(org.apache.hadoop.metrics2.lib.MutableMetricsFactory)的追加器。
log4j:警告请正确初始化log4j系统。
log4j:请参阅http://logging.apache.org/log4j/1.2/faq.html#noconfig 更多信息。
使用Spark的默认log4j配置文件:org/apache/Spark/log4j-defaults.properties
18/06/11 19:32:14信息安全管理器:将视图ACL更改为:根
18/06/11 19:32:14信息安全管理器:将修改ACL更改为:根
18/06/11 19:32:14信息安全管理器:将视图ACL组更改为:
18/06/11 19:32:14信息安全管理器:将修改ACL组更改为:
18/06/11 19:32:14信息安全管理器:安全管理器:身份验证已禁用;ui ACL被禁用;具有查看权限的用户:Set(root);具有查看权限的组:Set();具有修改权限的用户:Set(root);具有修改权限的组:Set()
18/06/11 19:32:15信息实用程序:已成功启动端口40594上的服务“驱动程序”。
18/06/11 19:32:15信息工作者观察者:连接到工作者spark://Worker@10.1.185.87:38235
2011年6月18日19:32:15信息传输客户端工厂:44毫秒后成功创建到/10.1.185.87:38235的连接(引导过程中花费0毫秒)
18/06/11 19:32:15信息WorkerWatcher:已成功连接到spark://Worker@10.1.185.87:38235
18/06/11 19:32:15信息检查点阅读器:找到检查点文件:文件:/ckp/Checkpoint-1528712675000,文件:/ckp/Checkpoint-1528712675000.bk,文件:/ckp/Checkpoint-1528712670000,文件:/ckp/Checkpoint-1528712670000.bk,文件:/ckp/Checkpoint-1528712665000.bk,文件:/ckp/Checkpoint-1528712665000.bk,文件:/ckp/Checkpoint-1528712660000,文件:/ckp/checkpoint-1528712660000.bk,文件:/ckp/checkpoint-1528712655000,文件:/ckp/checkpoint-1528712655000.bk
18/06/11 19:32:15信息检查点阅读器:尝试从文件加载检查点:/ckp/checkpoint-1528712675000
18/06/11 19:32:15信息检查点:时间检查点1528712675000毫秒已验证
18/06/11 19:32:15信息检查点阅读器:从文件成功加载检查点:/ckp/Checkpoint-1528712675000
18/06/11 19:32:15信息检查点阅读器:检查点在1528712675000毫秒时生成
18/06/11 19:32:15信息SparkContext:运行Spark版本2.2.0
18/06/11 19:32:15信息SparkContext:提交的申请:SparkStreamingWithCheckPointAndZK
18/06/11 19:32:15信息安全管理器:将视图ACL更改为:根
18/06/11 19:32:15信息安全管理器:将修改ACL更改为:根
18/06/11 19:32:15信息安全管理器:将视图ACL组更改为:
18/06/11 19:32:15信息安全管理器:将修改ACL组更改为:
18/06/11 19:32:15信息安全管理器:安全管理器:身份验证已禁用;ui ACL被禁用;具有查看权限的用户:Set(root);具有查看权限的组:Set();具有修改权限的用户:Set(root);具有修改权限的组:Set()
18/06/11 19:32:15信息提示:已在端口46544上成功启动服务“sparkDriver”。
18/06/11 19:32:15信息SparkEnv:注册MapOutputRacker
18/06/11 19:32:15信息SparkEnv:注册BlockManagerMaster
18/06/11 19:32:15信息块管理器MasterEndpoint:使用org.apache.spark.storage.DefaultTopologyMapper获取拓扑信息
18/06/11 19:32:15信息BlockManagerMasterEndpoint:BlockManagerMasterEndpoint向上
18/06/11 19:32:16信息DiskBlockManager:已在/tmp/blockmgr-623c4b9e-8045-4a19-a746-96a3b23c1184创建本地目录
18/06/11 19:32:16信息MemoryStore:MemoryStore以366.3 MB的容量启动
18/06/11 19:32:16信息SparkEnv:正在注册OutputCommitCoordinator
18/06/11 19:32:16信息实用程序:已成功启动端口4040上的服务“SparkUI”。
18/06/11 19:32:16信息SparkUI:将SparkUI绑定到0.0.0.0,并从http://10.1.185.87:4040
18/06/11 19:32:16信息SparkContext:添加了JARfile:///opt/spark/jars/spark-0.0.1-SNAPSHOT.jar 在spark://10.1.185.87:46544/jars/spark-0.0.1-SNAPSHOT.jar,时间戳为1528745536460
18/06/11 19:32:16信息StandaloneAppClient$ClientEndpoint:连接到主机spark://10.1.170.81:7077...
18/06/11 19:32:36信息StandaloneAppClient$ClientEndpoint:连接到主机spark://10.1.170.81:7077...
18/06/11 19:32:56信息StandaloneAppClient$ClientEndpoint:连接到主机spark://10.1.170.81:7077...
18/06/11 19:33:16错误StandalonesSchedulerBackend:应用程序已被终止。原因:所有的主人都没有反应!放弃。
18/06/11 19:33:16警告StandalonesSchedulerBackend:应用程序ID尚未初始化。
18/06/11 19:33:16信息SparkUI:已在停止Spark web UIhttp://10.1.185.87:4040
18/06/11 19:33:16信息StandaloneSchedulerBackend:关闭所有执行器
18/06/11 19:33:16信息实用程序:已在端口46323上成功启动服务“org.apache.spark.network.netty.NettyBlockTransferService”。
18/06/11 19:33:16信息NettyBlockTransferService:服务器创建于10.1.185.87:46323
18/06/11 19:33:16信息块管理器:使用org.apache.spark.storage.RandomBlockReplicationPolicy进行块复制
18/06/11 19:33:16 ERROR StandaloneSchedulerBackend: Application has been killed. Reason: All masters are unresponsive! Giving up.