Apache spark DataProc集群中的Spark删除执行器_Apache Spark_Hadoop_Google Cloud Platform_Spark Dataframe_Google Cloud Dataproc

Apache spark DataProc集群中的Spark删除执行器

apache-spark hadoop google-cloud-platform

Apache spark DataProc集群中的Spark删除执行器,apache-spark,hadoop,google-cloud-platform,spark-dataframe,google-cloud-dataproc,Apache Spark,Hadoop,Google Cloud Platform,Spark Dataframe,Google Cloud Dataproc,在执行Spark作业期间，我经常面临以下问题。甚至DataProc集群也是空闲的，Spark驱动程序正在杀死所有的执行器，而我最终只有一个执行器 18/02/26 08:47:05 INFO spark.ExecutorAllocationManager: Existing executor 35 has been removed (new total is 2) 18/02/26 08:50:40 INFO scheduler.TaskSetManager: Finished task 189

在执行Spark作业期间，我经常面临以下问题。甚至DataProc集群也是空闲的，Spark驱动程序正在杀死所有的执行器，而我最终只有一个执行器

18/02/26 08:47:05 INFO spark.ExecutorAllocationManager: Existing executor 35 has been removed (new total is 2)
18/02/26 08:50:40 INFO scheduler.TaskSetManager: Finished task 189.0 in stage 5.0 (TID 6002) in 569184 ms on dse-dev-dataproc-w-53.c.k-ddh-lle.internal (executor 57) (499/500)
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 57
18/02/26 08:51:40 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 57
18/02/26 08:51:40 INFO spark.ExecutorAllocationManager: Removing executor 57 because it has been idle for 60 seconds (new desired total will be 1)
18/02/26 08:51:42 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 57.
18/02/26 08:51:42 INFO scheduler.DAGScheduler: Executor lost: 57 (epoch 7)
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 57 from BlockManagerMaster.
18/02/26 08:51:42 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(57, dse-dev-dataproc-w-53.c.k-ddh-lle.internal, 53072, None)
18/02/26 08:51:42 INFO storage.BlockManagerMaster: Removed 57 successfully in removeExecutor
18/02/26 08:51:42 INFO cluster.YarnScheduler: Executor 57 on dse-dev-dataproc-w-53.c.k-ddh-lle.internal killed by driver.
18/02/26 08:51:42 INFO spark.ExecutorAllocationManager: Existing executor 57 has been removed (new total is 1)

一旦上面的任务完成，当它开始下一个阶段时，它将查找已洗牌的数据

18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.190:42676
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.157:45812
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 7 to 10.206.52.177:53612
18/02/26 08:52:33 INFO spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.206.52.166:41901

此窗口将挂起一段时间，任务失败，出现以下异常

18/02/26 09:12:33 INFO BlockManagerMaster: Removal of executor 21 requested
18/02/26 09:12:33 INFO YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 21
18/02/26 00:12:33 INFO BlockManagerMasterEndpoint: Trying to remove executor 21 from BlockManagerMaster.
18/02/26 09:12:33 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1480517110174_0001_01_000049 on host: ip-10-138-114-125.ec2.internal. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1480517110174_0001_01_000049
Exit code: 1
Stack trace: ExitCodeException exitCode=1: 
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
    at org.apache.hadoop.util.Shell.run(Shell.java:456)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

我的作业执行器设置：

spark.driver.maxResultSize  3840m
spark.driver.memory 16G
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.maxExecutors    100
spark.dynamicAllocation.minExecutors    1
spark.executor.cores    6
spark.executor.id   driver
spark.executor.instances    20
spark.executor.memory   30G
spark.hadoop.yarn.timeline-service.enabled  False
spark.shuffle.service.enabled   true
spark.sql.catalogImplementation hive
spark.sql.parquet.cacheMetadata false

这似乎是一个类似的问题。您可以尝试设置纱线容器的最小尺寸。在帖子中建议将请求的驱动程序/执行程序内存的大小设置为1.5倍。您是否正在使用“抢占节点”创建集群？如果是这样，GCP可在任何时候或24小时后撤销。这将导致您的执行器停止，但集群应该从中恢复。另一种可能的情况是，当您使用Hive/Spark对GCS中的数据进行操作时，GCS中的数据会发生更改。检查GCS存储桶中是否有任何并发更新。