Python Spark executor JVM崩溃

Python Spark executor JVM崩溃,python,jvm,pyspark,Python,Jvm,Pyspark,我有一个EMR集群,有一个主节点和四个工作节点。每个节点有4个内核和16gb的RAM。我尝试运行以下代码,以使我的数据符合逻辑回归: from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator from pyspark.ml.classification import LogisticRegression from pyspark.ml.classificat

我有一个EMR集群,有一个主节点和四个工作节点。每个节点有4个内核和16gb的RAM。我尝试运行以下代码,以使我的数据符合逻辑回归:

from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

lr = LogisticRegression(labelCol="label", featuresCol="features")


paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 1.0])
             .addGrid(lr.maxIter, [5, 10])
             .build())

# Create 2-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator(metricName="f1"), numFolds=2)

# Run cross validations
cvModel = cv.fit(age_training_data)
我的spark-defaults.conf是:

spark.driver.extraLibraryPath    /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath  /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions    -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions  -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled  false
spark.driver.memory                 10g
spark.yarn.driver.memoryOverhead    5g
spark.driver.cores                  3
spark.executor.memory               10g
spark.yarn.executor.memoryOverhead  2048m
spark.executor.instances             4
spark.executor.cores                 2
spark.default.parallelism           48
spark.yarn.max.executor.failures    32
spark.network.timeout               10000000s
spark.rpc.askTimeout                10000000s
spark.executor.heartbeatInterval    10000000s
spark.yarn.historyServer.address emr-header-1.cluster-60683:18080
spark.ui.view.acls *
#spark.serializer                    org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions     -XX:+UseG1GC
#spark.kryoserializer.buffer.max     128m
在整个过程中,许多遗嘱执行人被杀害。发生故障的容器的标准之一是:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007fe1926cb033, pid=21382, tid=0x00007fe1908db700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x5aa033]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/hs_err_pid21382.log
[thread 140606771681024 also had an error]
[thread 140606768523008 also had an error]
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
运行代码时,pyspark shell中显示的消息是:

WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1521314237048_0001_01_000005 on host: emr-worker-1.cluster-60683. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1521314237048_0001_01_000005
Exit code: 134
Exception message: /bin/bash: line 1: 21382 Aborted                 LD_LIBRARY_PATH=/usr/lib/hadoop-current/lib/native::/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native /usr/lib/jvm/java/bin/java -server -Xmx10240m '-XX:+UseG1GC' -Djava.io.tmpdir=/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/tmp '-Dspark.driver.port=33390' '-Dspark.rpc.askTimeout=10000000s' -Dspark.yarn.app.container.log.dir=/mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.16.9.204:33390 --executor-id 4 --hostname emr-worker-1.cluster-60683 --cores 2 --app-id application_1521314237048_0001 --user-class-path file:/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/__app__.jar > /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stdout 2> /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stderr
Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 21382 Aborted                 LD_LIBRARY_PATH=/usr/lib/hadoop-current/lib/native::/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native /usr/lib/jvm/java/bin/java -server -Xmx10240m '-XX:+UseG1GC' -Djava.io.tmpdir=/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/tmp '-Dspark.driver.port=33390' '-Dspark.rpc.askTimeout=10000000s' -Dspark.yarn.app.container.log.dir=/mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.16.9.204:33390 --executor-id 4 --hostname emr-worker-1.cluster-60683 --cores 2 --app-id application_1521314237048_0001 --user-class-path file:/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/__app__.jar > /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stdout 2> /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stderr
    at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
    at org.apache.hadoop.util.Shell.run(Shell.java:456)
    at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
    at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
    at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 134

如果有人能让我知道整个ml培训过程中执行器失败的原因,我将不胜感激。非常感谢

包含更多信息的错误报告文件保存为[…]/hs_err_pid21382.log-您应该抓取该文件。执行器被杀死后,hs_err_pid21382.log立即被删除,我无法查看它。是否有任何方法可以在容器死机的情况下访问它?如果目录被删除,您可以将JVM选项
XX:ErrorFile
设置为一个持久目录,并等待它再次出现。