Python Spark executor JVM崩溃
我有一个EMR集群,有一个主节点和四个工作节点。每个节点有4个内核和16gb的RAM。我尝试运行以下代码,以使我的数据符合逻辑回归:Python Spark executor JVM崩溃,python,jvm,pyspark,Python,Jvm,Pyspark,我有一个EMR集群,有一个主节点和四个工作节点。每个节点有4个内核和16gb的RAM。我尝试运行以下代码,以使我的数据符合逻辑回归: from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator from pyspark.ml.classification import LogisticRegression from pyspark.ml.classificat
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
lr = LogisticRegression(labelCol="label", featuresCol="features")
paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 2.0])
.addGrid(lr.elasticNetParam, [0.0, 1.0])
.addGrid(lr.maxIter, [5, 10])
.build())
# Create 2-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator(metricName="f1"), numFolds=2)
# Run cross validations
cvModel = cv.fit(age_training_data)
我的spark-defaults.conf是:
spark.driver.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled false
spark.driver.memory 10g
spark.yarn.driver.memoryOverhead 5g
spark.driver.cores 3
spark.executor.memory 10g
spark.yarn.executor.memoryOverhead 2048m
spark.executor.instances 4
spark.executor.cores 2
spark.default.parallelism 48
spark.yarn.max.executor.failures 32
spark.network.timeout 10000000s
spark.rpc.askTimeout 10000000s
spark.executor.heartbeatInterval 10000000s
spark.yarn.historyServer.address emr-header-1.cluster-60683:18080
spark.ui.view.acls *
#spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions -XX:+UseG1GC
#spark.kryoserializer.buffer.max 128m
在整个过程中,许多遗嘱执行人被杀害。发生故障的容器的标准之一是:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fe1926cb033, pid=21382, tid=0x00007fe1908db700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x5aa033]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/hs_err_pid21382.log
[thread 140606771681024 also had an error]
[thread 140606768523008 also had an error]
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
运行代码时,pyspark shell中显示的消息是:
WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1521314237048_0001_01_000005 on host: emr-worker-1.cluster-60683. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1521314237048_0001_01_000005
Exit code: 134
Exception message: /bin/bash: line 1: 21382 Aborted LD_LIBRARY_PATH=/usr/lib/hadoop-current/lib/native::/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native /usr/lib/jvm/java/bin/java -server -Xmx10240m '-XX:+UseG1GC' -Djava.io.tmpdir=/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/tmp '-Dspark.driver.port=33390' '-Dspark.rpc.askTimeout=10000000s' -Dspark.yarn.app.container.log.dir=/mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.16.9.204:33390 --executor-id 4 --hostname emr-worker-1.cluster-60683 --cores 2 --app-id application_1521314237048_0001 --user-class-path file:/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/__app__.jar > /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stdout 2> /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stderr
Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 21382 Aborted LD_LIBRARY_PATH=/usr/lib/hadoop-current/lib/native::/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native /usr/lib/jvm/java/bin/java -server -Xmx10240m '-XX:+UseG1GC' -Djava.io.tmpdir=/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/tmp '-Dspark.driver.port=33390' '-Dspark.rpc.askTimeout=10000000s' -Dspark.yarn.app.container.log.dir=/mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.16.9.204:33390 --executor-id 4 --hostname emr-worker-1.cluster-60683 --cores 2 --app-id application_1521314237048_0001 --user-class-path file:/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/__app__.jar > /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stdout 2> /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stderr
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 134
如果有人能让我知道整个ml培训过程中执行器失败的原因,我将不胜感激。非常感谢 包含更多信息的错误报告文件保存为[…]/hs_err_pid21382.log-您应该抓取该文件。执行器被杀死后,hs_err_pid21382.log立即被删除,我无法查看它。是否有任何方法可以在容器死机的情况下访问它?如果目录被删除,您可以将JVM选项
XX:ErrorFile
设置为一个持久目录,并等待它再次出现。