Java Apache Spark和Spark JobServer在几个小时后崩溃_Java_Mysql_Apache Spark_Spark Jobserver

Java Apache Spark和Spark JobServer在几个小时后崩溃

java mysql apache-spark

Java Apache Spark和Spark JobServer在几个小时后崩溃,java,mysql,apache-spark,spark-jobserver,Java,Mysql,Apache Spark,Spark Jobserver,我正在使用ApacheSpark2.0.2和ApacheJobserver0.7.0 我知道这不是最佳做法，但这是第一步。我的服务器有52 Gb RAM和6个CPU内核，Cent OS 7 x64，Java（TM）SE运行时环境（build 1.7.0_79-b15），并且它有以下使用指定内存配置运行的应用程序 JBossAS7（6GB） PDI Pentaho 6.0（12 Gb） MySQL（20GB） Apache Spark 2.0.2（8GB）我启动它，一切都按预期进行。工作了好

我正在使用ApacheSpark2.0.2和ApacheJobserver0.7.0

我知道这不是最佳做法，但这是第一步。我的服务器有52 Gb RAM和6个CPU内核，Cent OS 7 x64，Java（TM）SE运行时环境（build 1.7.0_79-b15），并且它有以下使用指定内存配置运行的应用程序

JBossAS7（6GB）
PDI Pentaho 6.0（12 Gb）
MySQL（20GB）
Apache Spark 2.0.2（8GB）

我启动它，一切都按预期进行。工作了好几个小时。我有一个jar，其中包含两个从我的_Job类扩展而来的实现的作业

public class VIQ_SparkJob extends JavaSparkJob {

protected SparkSession sparkSession;
protected String TENANT_ID;

@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
    sparkSession = SparkSession.builder()
                    .sparkContext(ctx)
                    .enableHiveSupport()
                    .config("spark.sql.warehouse.dir", "file:///value_iq/spark-warehouse/")
                    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                    .config("spark.kryoserializer.buffer", "8m")
                    .getOrCreate();
            Class<?>[] classes = new Class<?>[2];
            classes[0] = UsersCube.class;
            classes[1] = ImportCSVFiles.class;
    sparkSession.sparkContext().conf().registerKryoClasses(classes);
    TENANT_ID = jobConfig.getString("tenant_id");//parameters.getString("tenant_id");       
    return true;
}

@Override
public SparkJobValidation validate(SparkContext sc, Config config) {
    return SparkJobValid$.MODULE$; //To change body of generated methods, choose Tools | Templates.
}

}

我有主服务器和一个工作服务器。我的spark-defaults.conf

spark.debug.maxToStringFields  256
spark.shuffle.service.enabled true
spark.shuffle.file.buffer 64k
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 5
spark.rdd.compress true

这是我的Spark Jobserver settings.sh

DEPLOY_HOSTS="myserver.com"
APP_USER=root
APP_GROUP=root
JMX_PORT=5051
INSTALL_DIR=/bin/spark/job-server-master
LOG_DIR=/var/log/job-server-master
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=2G
SPARK_VERSION=2.0.2
MAX_DIRECT_MEMORY=2G
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.8

我使用以下方法创建上下文：

curl-k--basic--user'user:password'-d''

火花驱动器使用2Gb

创建的应用程序如下所示

ExecutorID  Worker                                      Cores   Memory  State       Logs
0           worker-20170203084218-157.97.107.42-50199   5       8192    RUNNING     stdout stderr

那些是我的遗嘱执行人

Executor ID     Address ▴               Status  RDD Blocks  Storage Memory      Disk Used   Cores   
driver          157.97.107.42:55222     Active  0           0.0 B / 1018.9 MB   0.0 B       0 
0               157.97.107.42:55223     Active  0           0.0 B / 4.1 GB      0.0 B       5

我有一个进程，它检查每个进程使用的内存，最大的内存量是8468MB

有4个与火花相关的过程

主进程。从分配1Gb内存开始，我不知道这个配置来自何处。但似乎已经足够了。顶部仅使用0.4 Gb
工人进程。与主机相同的内存使用
驱动程序进程。谁配置了2Gb
上下文。谁配置了8Gb

在下表中，您可以看到驱动程序和上下文使用的内存的行为。获取java.lang.OutOfMemoryError后：java堆空间。上下文失败，但驱动程序接受了另一个上下文，因此它仍然正常

system_user   | RAM(Mb)  |  entry_date
--------------+----------+---------------------
spark.driver    2472.11     2017-02-07 10:10:18 //Till here everything was fine
spark.context   5470.19     2017-02-07 10:10:18 //it was running for more thant 48 hours

spark.driver    2472.11     2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context   0.00        2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
                                                //in $LOG_FOLDER/job-server-master/server_startup.log

# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed


spark.driver    2472.11     2017-02-07 10:16:18 //Here I have deleted and created again
spark.context   105.20      2017-02-07 10:16:18

spark.driver    2577.30     2017-02-07 10:19:18 //Here I execute the three big 
spark.context   3734.46     2017-02-07 10:19:18 //concurrent queries again.

spark.driver    2577.30     2017-02-07 10:20:18 //Here after the queries where 
spark.context   5154.60     2017-02-07 10:20:18 //executed. No memory issue.

我有两个问题：

1-为什么当我检查spark GUI时，配置了2 Gb的驱动程序只使用1，执行器0也一样，只使用4.4 Gb。另一个配置的内存在哪里？但当系统中的进程被驱动时，它使用2Gb

2-如果服务器上有足够的内存，那么为什么内存不足？

当作业未运行时，能否检查saprk UI并计算实际可用内存量？您说过为Spark分配了8GB，但您应该为Spark本身保留一些内存。另外，您还有很多其他进程，但您正在为executor定位5个内核。我已经监控了系统，所有应用程序使用的最高内存约为42 Gb，因此有大约10 Gb的可用内存。我已经检查了Spark UI，worker在那里，但应用程序的状态为“已终止”，因此我无法查看内存状态。您是否看到任何缺少的配置？还是一种更好的方法来计算当前的记忆？关于记忆的一个疑问。如果我给jobserver 2Gb内存，那将是驱动程序内存，对吗？当我开始一个上下文时，比如说8Gb，8Gb对于执行者来说是6Gb，对于驱动者来说是2Gb？因为我正在监视每个spark进程，但我仍然不知道内存分配是如何工作的。在问题的末尾，我添加了驱动程序和上下文内存在崩溃前后的行为。

ExecutorID  Worker                                      Cores   Memory  State       Logs
0           worker-20170203084218-157.97.107.42-50199   5       8192    RUNNING     stdout stderr

Executor ID     Address ▴               Status  RDD Blocks  Storage Memory      Disk Used   Cores   
driver          157.97.107.42:55222     Active  0           0.0 B / 1018.9 MB   0.0 B       0 
0               157.97.107.42:55223     Active  0           0.0 B / 4.1 GB      0.0 B       5

system_user   | RAM(Mb)  |  entry_date
--------------+----------+---------------------
spark.driver    2472.11     2017-02-07 10:10:18 //Till here everything was fine
spark.context   5470.19     2017-02-07 10:10:18 //it was running for more thant 48 hours

spark.driver    2472.11     2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context   0.00        2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
                                                //in $LOG_FOLDER/job-server-master/server_startup.log

# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed


spark.driver    2472.11     2017-02-07 10:16:18 //Here I have deleted and created again
spark.context   105.20      2017-02-07 10:16:18

spark.driver    2577.30     2017-02-07 10:19:18 //Here I execute the three big 
spark.context   3734.46     2017-02-07 10:19:18 //concurrent queries again.

spark.driver    2577.30     2017-02-07 10:20:18 //Here after the queries where 
spark.context   5154.60     2017-02-07 10:20:18 //executed. No memory issue.