Scala Spark Standalone java.lang.OutOfMemoryError
我在64GB ram服务器上安装了Spark独立应用程序。我想要处理的数据量远远超过了我能承受的ram量 我正在将大量数据读入一个大表中,并尝试在其中进行自连接。伪代码如下所示:Scala Spark Standalone java.lang.OutOfMemoryError,scala,apache-spark,Scala,Apache Spark,我在64GB ram服务器上安装了Spark独立应用程序。我想要处理的数据量远远超过了我能承受的ram量 我正在将大量数据读入一个大表中,并尝试在其中进行自连接。伪代码如下所示: val df = spark.read.parquet("huge_table.parquet") val df2 = df.select(...).withColumn(...) // some data manipulations df.as("df1").join(df2.as("df2"), $"df1.sto
val df = spark.read.parquet("huge_table.parquet")
val df2 = df.select(...).withColumn(...) // some data manipulations
df.as("df1").join(df2.as("df2"), $"df1.store_name" == $"df2.store_name" && $"df1.city_id" === $"df2.city_id")
我的执行器设置如下所示--driver memory 8g--executor memory 32g
spark defaults.conf
:
spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
问题是无论我做什么,我都会
Caused by: java.lang.OutOfMemoryError: Java heap space
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:53)
at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:472)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:142)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.<init>(WindowExec.scala:310)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:290)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:289)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
df2
对我来说,这似乎是笛卡尔式的连接。不是吗?@eliasah不确定这是否是笛卡尔连接,但表有不同的行数和多个连接条件。有时,如果为空,则会导致连接行。对于当前问题的详细信息,我们无法提供太多帮助,因此我建议尝试将spark.sql.shuffle.partitions设置为2001。ApacheSpark在分区数>2000时使用不同的数据结构进行洗牌簿记,请尝试重新分区,如我所说。事实上,这是一个笛卡尔接缝,你肯定需要更多的RAM
+----+--------------+---------+
| ID | store_name | city_id |
+----+--------------+---------+
| 1 | Apple ... | 22 |
| 2 | Apple ... | 33 |
| 3 | Apple ... | 44 |
+----+--------------+---------+
+----+--------------+---------+---------+-------------+
| ID | store_name | city_id | sale_id | sale_amount |
+----+--------------+---------+---------+-------------+
| 1 | Apple ... | 33 | 1 | $30 |
| 2 | Apple ... | 44 | 2 | $50 |
| 3 | Apple ... | 44 | 3 | $50 |
| 4 | Apple ... | 44 | 4 | $50 |
| 5 | Apple ... | 44 | 5 | $40 |
+----+--------------+---------+---------+-------------+