Hadoop Spark-Container的运行超出了物理内存限制

Hadoop Spark-Container的运行超出了物理内存限制,hadoop,apache-spark,spark-graphx,Hadoop,Apache Spark,Spark Graphx,我有一个由两个工作节点组成的集群。 辅助节点1-64GB RAM 工作节点2-32GB RAM 背景概述: 我试图在纱线簇上执行spark submit,在图形上运行Pregel,以计算从一个源顶点到所有其他顶点的最短路径距离,并在控制台上打印值。 实验: 对于具有15个顶点的小型图,执行完成应用程序最终状态:成功 我的代码工作得很好,可以打印241个顶点的最短距离图,将单个顶点作为源顶点,但存在一个问题 问题: 当我深入日志文件时,任务在4分钟26秒内成功完成,但仍在终端上显示应用程序状态为运

我有一个由两个工作节点组成的集群。 辅助节点1-64GB RAM 工作节点2-32GB RAM

背景概述: 我试图在纱线簇上执行spark submit,在图形上运行Pregel,以计算从一个源顶点到所有其他顶点的最短路径距离,并在控制台上打印值。 实验:

  • 对于具有15个顶点的小型图,执行完成应用程序最终状态:成功
  • 我的代码工作得很好,可以打印241个顶点的最短距离图,将单个顶点作为源顶点,但存在一个问题
  • 问题: 当我深入日志文件时,任务在4分钟26秒内成功完成,但仍在终端上显示应用程序状态为运行,大约12分钟后任务执行终止-

    Application application_1447669815913_0002 failed 2 times due to AM Container for appattempt_1447669815913_0002_000002 exited with exitCode: -104 For more detailed output, check application tracking page:http://myserver.com:8088/proxy/application_1447669815913_0002/
    Then, click on links to logs of each attempt. 
    Diagnostics: Container [pid=47384,containerID=container_1447669815913_0002_02_000001] is running beyond physical memory limits. Current usage: 17.9 GB of 17.5 GB physical memory used; 18.7 GB of 36.8 GB virtual memory used. Killing container.
    
    Dump of the process-tree for container_1447669815913_0002_02_000001 : 
     |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 47387 47384 47384 47384 (java) 100525 13746 20105633792 4682973 /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -Xmx16384m -Djava.io.tmpdir=/yarn/nm/usercache/cloudera/appcache/application_1447669815913_0002/container_1447669815913_0002_02_000001/tmp -Dspark.eventLog.enabled=true -Dspark.eventLog.dir=hdfs://myserver.com:8020/user/spark/applicationHistory -Dspark.executor.memory=14g -Dspark.shuffle.service.enabled=false -Dspark.yarn.executor.memoryOverhead=2048 -Dspark.yarn.historyServer.address=http://myserver.com:18088 -Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.shuffle.service.port=7337 -Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar -Dspark.serializer=org.apache.spark.serializer.KryoSerializer -Dspark.authenticate=false -Dspark.app.name=com.path.PathFinder -Dspark.master=yarn-cluster -Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.path.PathFinder --jar file:/home/cloudera/Documents/Longest_Path_Data_1/Jars/ShortestPath_Loop-1.0.jar --arg /home/cloudera/workspace/Spark-Integration/LongestWorstPath/configFile --executor-memory 14336m --executor-cores 32 --num-executors 2
    |- 47384 47382 47384 47384 (bash) 2 0 17379328 853 /bin/bash -c LD_LIBRARY_PATH=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native::/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native /usr/lib/jvm/java-7-oracle-cloudera/bin/java -server -Xmx16384m -Djava.io.tmpdir=/yarn/nm/usercache/cloudera/appcache/application_1447669815913_0002/container_1447669815913_0002_02_000001/tmp '-Dspark.eventLog.enabled=true' '-Dspark.eventLog.dir=hdfs://myserver.com:8020/user/spark/applicationHistory' '-Dspark.executor.memory=14g' '-Dspark.shuffle.service.enabled=false' '-Dspark.yarn.executor.memoryOverhead=2048' '-Dspark.yarn.historyServer.address=http://myserver.com:18088' '-Dspark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' '-Dspark.shuffle.service.port=7337' '-Dspark.yarn.jar=local:/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/spark/lib/spark-assembly.jar' '-Dspark.serializer=org.apache.spark.serializer.KryoSerializer' '-Dspark.authenticate=false' '-Dspark.app.name=com.path.PathFinder' '-Dspark.master=yarn-cluster' '-Dspark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' '-Dspark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.4.7-1.cdh5.4.7.p0.3/lib/hadoop/lib/native' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.path.PathFinder' --jar file:/home/cloudera/Documents/Longest_Path_Data_1/Jars/ShortestPath_Loop-1.0.jar --arg '/home/cloudera/workspace/Spark-Integration/LongestWorstPath/configFile' --executor-memory 14336m --executor-cores 32 --num-executors 2 1> /var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001/stdout 2> /var/log/hadoop-yarn/container/application_1447669815913_0002/container_1447669815913_0002_02_000001/stderr
    Container killed on request. Exit code is 143
    Container exited with a non-zero exit code 143
    Failing this attempt. Failing the application.
    
    我尝试过的事情:

  • 纱线.schedular.maximum-allocation-mb–32GB
  • mapreduce.map.memory.mb=2048(以前是1024)
  • 尝试改变-驱动程序内存高达24g

  • 请您详细说明我如何配置资源管理器,以便也可以处理大尺寸图形(>300K个顶点)?谢谢。

    Spark作业以与MapReduce作业不同的方式从资源管理器请求资源。尝试调整执行器的数量以及分配给每个执行器的mem/vcore。遵循

    处理的数据越多,每个Spark任务需要的内存就越多。如果您的执行器正在运行太多的任务,那么它可能会耗尽内存。当我在处理大量数据时遇到问题,通常是因为没有正确平衡每个执行器的内核数量。尝试减少内核数或增加执行器内存


    判断内存问题的一个简单方法是检查Spark UI上的Executor选项卡。如果您看到大量红色条表示垃圾收集时间很长,则可能是执行器内存不足

    只需将
    spark.driver.memory
    的默认配置从
    512m
    增加到
    2g
    就可以解决我的问题


    如果内存一直出现相同的错误,则可以将其设置为更高。然后,您可以不断减少它,直到它遇到相同的错误,以便您知道用于您的工作的最佳驱动程序内存

    我解决了本例中的错误,以增加spark.Thread.executor.memoryOverhead的conf,它代表堆外内存
    当您增加驱动程序内存和执行器内存时,不要忘记此配置项

    我也有类似的问题:

    关键错误信息:

    • 退出代码:-104
    • 物理内存限制
    同时增加
    spark.executor.memory
    spark.executor.memoryOverhead
    未生效


    然后我增加了spark.driver.memory,解决了这个问题。

    前面有一个类似的问题:@aditya你找到什么了吗?另一个对我没有帮助,你需要用集群的容量来微调你的应用程序。Params--驱动程序内存--执行器内存--执行器核心--num执行器在集群上执行spark submit时起着非常重要的作用。请看一下,我也有同样的问题。有人知道我如何理解记忆中的哪个操作吗?如果是某个连接或某个缓存数据?谢谢他说容器内存不足,不是执行者
    Application application_1577148289818_10686 failed 2 times due to AM Container for appattempt_1577148289818_10686_000002 exited with **exitCode: -104**
    
    Failing this attempt.Diagnostics: [2019-12-26 09:13:54.392]Container [pid=18968,containerID=container_e96_1577148289818_10686_02_000001] is running 132722688B beyond the **'PHYSICAL' memory limit**. Current usage: 1.6 GB of 1.5 GB physical memory used; 4.6 GB of 3.1 GB virtual memory used. Killing container.