Apache spark Spark中非常大的任务
我有一个程序,将excel文件转换为Spark数据帧,然后以压缩的ORC格式在我们的datalake上写入此文件。请注意,我在使用Spark 1.6.2 API时受到限制Apache spark Spark中非常大的任务,apache-spark,serialization,hivecontext,Apache Spark,Serialization,Hivecontext,我有一个程序,将excel文件转换为Spark数据帧,然后以压缩的ORC格式在我们的datalake上写入此文件。请注意,我在使用Spark 1.6.2 API时受到限制 变量sq是一个HiveContext 变量schema包含一个小尺寸(25ko)的sparkStructType 变量excelData包含Spark行的java列表,其中包含少量数据 代码如下: val df = sq.createDataFrame(excelData, schema) log.info(Writin
- 变量
是一个sq
HiveContext
- 变量
包含一个小尺寸(25ko)的sparkschema
StructType
- 变量
包含SparkexcelData
行的java
,其中包含少量数据列表
val df = sq.createDataFrame(excelData, schema)
log.info(Writing Spark DataFrame as ORC file...)
df.write.mode(SaveMode.Overwrite).option("compression", "snappy").orc("myfile.orc")
这是我的纱线日志:
17/06/16 17:03:13 ERROR ApplicationMaster: User class threw exception: java.lang.OutOfMemoryError: Java heap space
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at scala.StringContext.standardInterpolator(StringContext.scala:123)
at scala.StringContext.s(StringContext.scala:90)
at org.apache.spark.sql.execution.QueryExecution.toString(QueryExecution.scala:70)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:52)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation.run(InsertIntoHadoopFsRelation.scala:106)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
at org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
at org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:132)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:130)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:130)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:55)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:55)
at org.apache.spark.sql.execution.datasources.ResolvedDataSource$.apply(ResolvedDataSource.scala:256)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:148)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:139)
at org.apache.spark.sql.DataFrameWriter.orc(DataFrameWriter.scala:346)
at preprocess.Run$.main(Run.scala:109)
at preprocess.Run.main(Run.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:559)
这里发生了什么?我觉得序列化任务的大小太大。为什么要广播excelData?我广播它,因为它看起来太大,无法将它传输给每个工作人员。我错了吗?实际上,广播将把整个数据集传送给每个工人。这就是为什么只有在绝对必要的情况下才应该使用它。否则,就让Spark决定如何在集群中分割工作。另一方面,您似乎只是将excel转换为orc文件。你需要Spark吗?我必须使用Spark,它是唯一允许使用的工具。@sweeeeet只要你不接受,你的问题仍然被认为没有得到回答。删除答案可能会降低其他用户的可视性。