Java Apache Spark和Spark JobServer在几个小时后崩溃

Java Apache Spark和Spark JobServer在几个小时后崩溃,java,mysql,apache-spark,spark-jobserver,Java,Mysql,Apache Spark,Spark Jobserver,我正在使用ApacheSpark2.0.2和ApacheJobserver0.7.0 我知道这不是最佳做法,但这是第一步。我的服务器有52 Gb RAM和6个CPU内核,Cent OS 7 x64,Java(TM)SE运行时环境(build 1.7.0_79-b15),并且它有以下使用指定内存配置运行的应用程序 JBossAS7(6GB) PDI Pentaho 6.0(12 Gb) MySQL(20GB) Apache Spark 2.0.2(8GB) 我启动它,一切都按预期进行。工作了好

我正在使用ApacheSpark2.0.2和ApacheJobserver0.7.0

我知道这不是最佳做法,但这是第一步。我的服务器有52 Gb RAM和6个CPU内核,Cent OS 7 x64,Java(TM)SE运行时环境(build 1.7.0_79-b15),并且它有以下使用指定内存配置运行的应用程序

  • JBossAS7(6GB)
  • PDI Pentaho 6.0(12 Gb)
  • MySQL(20GB)
  • Apache Spark 2.0.2(8GB)
我启动它,一切都按预期进行。工作了好几个小时。 我有一个jar,其中包含两个从我的_Job类扩展而来的实现的作业

public class VIQ_SparkJob extends JavaSparkJob {

protected SparkSession sparkSession;
protected String TENANT_ID;

@Override
public Object runJob(SparkContext jsc, Config jobConfig) {
    sparkSession = SparkSession.builder()
                    .sparkContext(ctx)
                    .enableHiveSupport()
                    .config("spark.sql.warehouse.dir", "file:///value_iq/spark-warehouse/")
                    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
                    .config("spark.kryoserializer.buffer", "8m")
                    .getOrCreate();
            Class<?>[] classes = new Class<?>[2];
            classes[0] = UsersCube.class;
            classes[1] = ImportCSVFiles.class;
    sparkSession.sparkContext().conf().registerKryoClasses(classes);
    TENANT_ID = jobConfig.getString("tenant_id");//parameters.getString("tenant_id");       
    return true;
}

@Override
public SparkJobValidation validate(SparkContext sc, Config config) {
    return SparkJobValid$.MODULE$; //To change body of generated methods, choose Tools | Templates.
}

}
我有主服务器和一个工作服务器。我的spark-defaults.conf

spark.debug.maxToStringFields  256
spark.shuffle.service.enabled true
spark.shuffle.file.buffer 64k
spark.dynamicAllocation.enabled true
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 5
spark.rdd.compress true
这是我的Spark Jobserver settings.sh

DEPLOY_HOSTS="myserver.com"
APP_USER=root
APP_GROUP=root
JMX_PORT=5051
INSTALL_DIR=/bin/spark/job-server-master
LOG_DIR=/var/log/job-server-master
PIDFILE=spark-jobserver.pid
JOBSERVER_MEMORY=2G
SPARK_VERSION=2.0.2
MAX_DIRECT_MEMORY=2G
SPARK_CONF_DIR=$SPARK_HOME/conf
SCALA_VERSION=2.11.8
我使用以下方法创建上下文:

curl-k--basic--user'user:password'-d''

火花驱动器使用2Gb

创建的应用程序如下所示

ExecutorID  Worker                                      Cores   Memory  State       Logs
0           worker-20170203084218-157.97.107.42-50199   5       8192    RUNNING     stdout stderr
那些是我的遗嘱执行人

Executor ID     Address ▴               Status  RDD Blocks  Storage Memory      Disk Used   Cores   
driver          157.97.107.42:55222     Active  0           0.0 B / 1018.9 MB   0.0 B       0 
0               157.97.107.42:55223     Active  0           0.0 B / 4.1 GB      0.0 B       5 
我有一个进程,它检查每个进程使用的内存,最大的内存量是8468MB

有4个与火花相关的过程

  • 主进程。从分配1Gb内存开始,我不知道这个配置来自何处。但似乎已经足够了。顶部仅使用0.4 Gb
  • 工人进程。与主机相同的内存使用
  • 驱动程序进程。谁配置了2Gb
  • 上下文。谁配置了8Gb
在下表中,您可以看到驱动程序和上下文使用的内存的行为。获取java.lang.OutOfMemoryError后:java堆空间。上下文失败,但驱动程序接受了另一个上下文,因此它仍然正常

system_user   | RAM(Mb)  |  entry_date
--------------+----------+---------------------
spark.driver    2472.11     2017-02-07 10:10:18 //Till here everything was fine
spark.context   5470.19     2017-02-07 10:10:18 //it was running for more thant 48 hours

spark.driver    2472.11     2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context   0.00        2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
                                                //in $LOG_FOLDER/job-server-master/server_startup.log

# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed


spark.driver    2472.11     2017-02-07 10:16:18 //Here I have deleted and created again
spark.context   105.20      2017-02-07 10:16:18

spark.driver    2577.30     2017-02-07 10:19:18 //Here I execute the three big 
spark.context   3734.46     2017-02-07 10:19:18 //concurrent queries again.

spark.driver    2577.30     2017-02-07 10:20:18 //Here after the queries where 
spark.context   5154.60     2017-02-07 10:20:18 //executed. No memory issue.
我有两个问题:

1-为什么当我检查spark GUI时,配置了2 Gb的驱动程序只使用1,执行器0也一样,只使用4.4 Gb。另一个配置的内存在哪里?但当系统中的进程被驱动时,它使用2Gb


2-如果服务器上有足够的内存,那么为什么内存不足?

当作业未运行时,能否检查saprk UI并计算实际可用内存量?您说过为Spark分配了8GB,但您应该为Spark本身保留一些内存。另外,您还有很多其他进程,但您正在为executor定位5个内核。我已经监控了系统,所有应用程序使用的最高内存约为42 Gb,因此有大约10 Gb的可用内存。我已经检查了Spark UI,worker在那里,但应用程序的状态为“已终止”,因此我无法查看内存状态。您是否看到任何缺少的配置?还是一种更好的方法来计算当前的记忆?关于记忆的一个疑问。如果我给jobserver 2Gb内存,那将是驱动程序内存,对吗?当我开始一个上下文时,比如说8Gb,8Gb对于执行者来说是6Gb,对于驱动者来说是2Gb?因为我正在监视每个spark进程,但我仍然不知道内存分配是如何工作的。在问题的末尾,我添加了驱动程序和上下文内存在崩溃前后的行为。
ExecutorID  Worker                                      Cores   Memory  State       Logs
0           worker-20170203084218-157.97.107.42-50199   5       8192    RUNNING     stdout stderr
Executor ID     Address ▴               Status  RDD Blocks  Storage Memory      Disk Used   Cores   
driver          157.97.107.42:55222     Active  0           0.0 B / 1018.9 MB   0.0 B       0 
0               157.97.107.42:55223     Active  0           0.0 B / 4.1 GB      0.0 B       5 
system_user   | RAM(Mb)  |  entry_date
--------------+----------+---------------------
spark.driver    2472.11     2017-02-07 10:10:18 //Till here everything was fine
spark.context   5470.19     2017-02-07 10:10:18 //it was running for more thant 48 hours

spark.driver    2472.11     2017-02-07 10:11:18 //Then I execute three big concurrent queries
spark.context   0.00        2017-02-07 10:11:18 //and I get java.lang.OutOfMemoryError: Java heap space
                                                //in $LOG_FOLDER/job-server-master/server_startup.log

# I've check and the context was still present in the jobserver but unresponding.
#in spark the application was killed


spark.driver    2472.11     2017-02-07 10:16:18 //Here I have deleted and created again
spark.context   105.20      2017-02-07 10:16:18

spark.driver    2577.30     2017-02-07 10:19:18 //Here I execute the three big 
spark.context   3734.46     2017-02-07 10:19:18 //concurrent queries again.

spark.driver    2577.30     2017-02-07 10:20:18 //Here after the queries where 
spark.context   5154.60     2017-02-07 10:20:18 //executed. No memory issue.