Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark Standalone java.lang.OutOfMemoryError_Scala_Apache Spark - Fatal编程技术网

Scala Spark Standalone java.lang.OutOfMemoryError

Scala Spark Standalone java.lang.OutOfMemoryError,scala,apache-spark,Scala,Apache Spark,我在64GB ram服务器上安装了Spark独立应用程序。我想要处理的数据量远远超过了我能承受的ram量 我正在将大量数据读入一个大表中,并尝试在其中进行自连接。伪代码如下所示: val df = spark.read.parquet("huge_table.parquet") val df2 = df.select(...).withColumn(...) // some data manipulations df.as("df1").join(df2.as("df2"), $"df1.sto

我在64GB ram服务器上安装了Spark独立应用程序。我想要处理的数据量远远超过了我能承受的ram量

我正在将大量数据读入一个大表中,并尝试在其中进行自连接。伪代码如下所示:

val df = spark.read.parquet("huge_table.parquet")
val df2 = df.select(...).withColumn(...) // some data manipulations
df.as("df1").join(df2.as("df2"), $"df1.store_name" == $"df2.store_name" && $"df1.city_id" === $"df2.city_id")
我的执行器设置如下所示--driver memory 8g--executor memory 32g

spark defaults.conf

spark.driver.extraJavaOptions -XX:+UseG1GC
spark.executor.extraJavaOptions  -XX:+UseG1GC -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
问题是无论我做什么,我都会

Caused by: java.lang.OutOfMemoryError: Java heap space
  at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillReader.<init>(UnsafeSorterSpillReader.java:53)
  at org.apache.spark.util.collection.unsafe.sort.UnsafeSorterSpillWriter.getReader(UnsafeSorterSpillWriter.java:150)
  at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.getSortedIterator(UnsafeExternalSorter.java:472)
  at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:142)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.fetchNextRow(WindowExec.scala:301)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.<init>(WindowExec.scala:310)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:290)
  at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14.apply(WindowExec.scala:289)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
df2


对我来说,这似乎是笛卡尔式的连接。不是吗?@eliasah不确定这是否是笛卡尔连接,但表有不同的行数和多个连接条件。有时,如果为空,则会导致连接行。对于当前问题的详细信息,我们无法提供太多帮助,因此我建议尝试将spark.sql.shuffle.partitions设置为2001。ApacheSpark在分区数>2000时使用不同的数据结构进行洗牌簿记,请尝试重新分区,如我所说。事实上,这是一个笛卡尔接缝,你肯定需要更多的RAM
+----+--------------+---------+
| ID |  store_name  | city_id |
+----+--------------+---------+
|  1 | Apple ...    |      22 |
|  2 | Apple ...    |      33 |
|  3 | Apple ...    |      44 |
+----+--------------+---------+
+----+--------------+---------+---------+-------------+
| ID |  store_name  | city_id | sale_id | sale_amount |
+----+--------------+---------+---------+-------------+
|  1 | Apple ...    |      33 |       1 | $30         |
|  2 | Apple ...    |      44 |       2 | $50         |
|  3 | Apple ...    |      44 |       3 | $50         |
|  4 | Apple ...    |      44 |       4 | $50         |
|  5 | Apple ...    |      44 |       5 | $40         |
+----+--------------+---------+---------+-------------+