Apache spark 在纱线模式结束时使用“火花”;退出状态:-100。诊断:容器在“丢失”节点上释放;

Apache spark 在纱线模式结束时使用“火花”;退出状态:-100。诊断:容器在“丢失”节点上释放;,apache-spark,yarn,emr,Apache Spark,Yarn,Emr,我正在尝试加载一个数据库,其中包含1TB的数据,以便使用最新的EMR在AWS上启动。运行时间很长,甚至不到6小时,但在运行6h30m之后,我收到一些错误消息,宣布容器在丢失的节点上释放,然后作业失败。日志如下所示: 16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure

我正在尝试加载一个数据库,其中包含1TB的数据,以便使用最新的EMR在AWS上启动。运行时间很长,甚至不到6小时,但在运行6h30m之后,我收到一些错误消息,宣布容器在丢失的节点上释放,然后作业失败。日志如下所示:

16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144178.0 in stage 0.0 (TID 144178, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144181.0 in stage 0.0 (TID 144181, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144175.0 in stage 0.0 (TID 144175, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144213.0 in stage 0.0 (TID 144213, ip-10-0-2-176.ec2.internal): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000006 on host: ip-10-0-2-176.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 5 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 5 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(5, ip-10-0-2-176.ec2.internal, 43922)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 5 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 6 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 5 has been removed (new total is 41)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144138.0 in stage 0.0 (TID 144138, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144185.0 in stage 0.0 (TID 144185, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144184.0 in stage 0.0 (TID 144184, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144186.0 in stage 0.0 (TID 144186, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 6 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000007 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 6 (epoch 0)
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 6 from BlockManagerMaster.
16/07/01 22:45:43 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(6, ip-10-0-2-173.ec2.internal, 43593)
16/07/01 22:45:43 INFO storage.BlockManagerMaster: Removed 6 successfully in removeExecutor
16/07/01 22:45:43 ERROR cluster.YarnClusterScheduler: Lost executor 30 on ip-10-0-2-173.ec2.internal: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144162.0 in stage 0.0 (TID 144162, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO spark.ExecutorAllocationManager: Existing executor 6 has been removed (new total is 40)
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144156.0 in stage 0.0 (TID 144156, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144170.0 in stage 0.0 (TID 144170, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 WARN scheduler.TaskSetManager: Lost task 144169.0 in stage 0.0 (TID 144169, ip-10-0-2-173.ec2.internal): ExecutorLostFailure (executor 30 exited caused by one of the running tasks) Reason: Container marked as failed: container_1467389397754_0001_01_000035 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
16/07/01 22:45:43 INFO scheduler.DAGScheduler: Executor lost: 30 (epoch 0)
16/07/01 22:45:43 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1467389397754_0001_01_000024 on host: ip-10-0-2-173.ec2.internal. Exit status: -100. Diagnostics: Container released on a *lost* node
我非常确定我的网络设置是有效的,因为我尝试在更小的表上在相同的环境中运行此脚本


另外,我知道有人在6个月前发布了一个问题,问了同样的问题:但我仍然必须问,因为没有人回答这个问题。

看起来其他人也有同样的问题,所以我只是发布了一个答案,而不是写评论。我不确定这是否能解决问题,但这应该是一个想法

如果您使用spot实例,您应该知道,如果价格高于您的输入,spot实例将被关闭,您将遇到这个问题。即使您只是将spot实例用作从属实例。所以我的解决方案是不使用任何spot实例来长期运行作业


另一个想法是将作业分割成许多独立的步骤,这样就可以将每个步骤的结果保存为S3上的文件。如果发生任何错误,只需从缓存文件开始执行该步骤。

是动态分配内存吗?我也有类似的问题,我通过计算执行器内存、执行器内核和执行器来进行静态分配,从而解决了这个问题。
尝试在Spark中为巨大的工作负载进行静态分配。

这意味着您的纱线容器已关闭,要调试发生的情况,您必须阅读纱线日志,使用官方CLI
纱线日志-applicationId
或随意使用纱线查看器作为web应用程序并为我的项目贡献力量


您应该会看到很多工作错误。

我遇到了相同的问题。我在这篇关于DZone的文章中找到了一些线索:

通过增加数据帧分区的数量(在本例中,从1024个增加到2048个)解决了这个问题。这减少了每个分区所需的内存



因此,我尝试增加数据帧分区的数量,从而解决了我的问题。

AWS将此作为常见问题发布

电子病历:

对于粘合作业:

我遇到了同样的问题。没有答案:(@clay只是我的猜测。当价格高于您的价格时,spot实例将被收回,然后节点将丢失。因此,如果您正在运行长期作业,请不要使用spot实例。我找到了一种方法,可以将数据集拆分为许多小任务,每个小任务只运行5分钟,并在s3上保存一个reduce结果,在所有这些之后,read从s3得到结果,然后再做一次reduce,这样我就可以避免长时间运行的作业。我也遇到了这个问题:/此处有类似的问题(但是有一个大的自连接)。现在已经有一段时间了。资源管理器上的日志只是说容器丢失了。没有迹象表明原因。内存可能是一个问题。您可以共享节点的日志吗?取消未使用的DFs的持久化是否有帮助?在这种情况下,您可以尝试,您是在EMR还是Cloudera堆栈上?同时检查纱线调度程序的资源管理是否公平r容量,然后通过传递执行器的数量等尝试静态内存分配而不是动态内存分配…我使用EMR,并且在使用unpersist后,我也没有发现动态内存变化有任何差异。我要求您通过关闭动态,传入执行器、执行器内存和执行器内核的数量来使用静态内存分配通过计算它们,而不是让它们激发基于您的解决方案的动态内存分配,第一个选项是:获取专用的核心节点,而不是点任务节点。第二个选项是基本上将作业分解为多个作业,并以渐进的方式手动运行它们?