Scala 调用超过1000列的stddev时,SparkSQL作业失败
我使用的是Spark 2.2.1和Scala 2.11的DataBricks。我正在尝试运行一个SQL查询,如下所示Scala 调用超过1000列的stddev时,SparkSQL作业失败,scala,apache-spark,databricks,Scala,Apache Spark,Databricks,我使用的是Spark 2.2.1和Scala 2.11的DataBricks。我正在尝试运行一个SQL查询,如下所示 select stddev(col1), stddev(col2), ..., stddev(col1300) from mydb.mytable 然后,我按如下方式执行代码 myRdd = sqlContext.sql(sql) 但是,我看到以下异常被抛出 Job aborted due to stage failure: Task 24 in stage 16.0 fai
select stddev(col1), stddev(col2), ..., stddev(col1300)
from mydb.mytable
然后,我按如下方式执行代码
myRdd = sqlContext.sql(sql)
但是,我看到以下异常被抛出
Job aborted due to stage failure: Task 24 in stage 16.0 failed 4 times, most recent failure: Lost task 24.3 in stage 16.0 (TID 1946, 10.184.163.105, executor 3): org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection has grown past JVM limit of 0xFFFF
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */ return new SpecificMutableProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificMutableProjection extends org.apache.spark.sql.catalyst.expressions.codegen.BaseMutableProjection {
/* 006 */
/* 007 */ private Object[] references;
/* 008 */ private InternalRow mutableRow;
/* 009 */ private boolean evalExprIsNull;
/* 010 */ private boolean evalExprValue;
/* 011 */ private boolean evalExpr1IsNull;
/* 012 */ private boolean evalExpr1Value;
/* 013 */ private boolean evalExpr2IsNull;
/* 014 */ private boolean evalExpr2Value;
/* 015 */ private boolean evalExpr3IsNull;
/* 016 */ private boolean evalExpr3Value;
/* 017 */ private boolean evalExpr4IsNull;
/* 018 */ private boolean evalExpr4Value;
/* 019 */ private boolean evalExpr5IsNull;
/* 020 */ private boolean evalExpr5Value;
/* 021 */ private boolean evalExpr6IsNull;
问题似乎出在stddev
上,但例外情况没有帮助。有什么想法吗?是否有其他方法可以轻松计算标准偏差,而不会导致此问题
事实证明,这描述了同样的问题,因为受64KB大小类的限制,Spark无法处理宽模式或大量列。但是,如果是这种情况,那么为什么avg
和percentile_近似值
起作用呢?有几个选项:
- 尝试禁用整个阶段代码生成:
spark.conf.set("spark.sql.codegen.wholeStage", false)
- 如果上述情况无助于切换到RDD(由采用):
- 使用
将列组装成单个向量,并使用VectorAssembler
,与使用的方法类似,调整Aggregator
方法(您可能需要一些额外的调整来将finish
转换为ml.linalg.Vectors
)mllib.linalg.Vectors
spark.conf.set("spark.sql.codegen.wholeStage", false)
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.MultivariateOnlineSummarizer
val columns: Seq[String] = ???
df
.select(columns map (col(_).cast("double")): _*)
.rdd
.map(row => Vectors.dense(columns.map(row.getAs[Double](_)).toArray))
.aggregate(new MultivariateOnlineSummarizer)(
(agg, v) => agg.add(v),
(agg1, agg2) => agg1.merge(agg2))