Scala 在s3铲斗中装载时发生SPARKOUTOFMEMORY错误_Scala_Apache Spark_Amazon S3_Apache Spark Sql_Out Of Memory

Scala 在s3铲斗中装载时发生SPARKOUTOFMEMORY错误

scala apache-spark amazon-s3

Scala 在s3铲斗中装载时发生SPARKOUTOFMEMORY错误,scala,apache-spark,amazon-s3,apache-spark-sql,out-of-memory,Scala,Apache Spark,Amazon S3,Apache Spark Sql,Out Of Memory,我有一个数据帧并写入S3存储桶目标位置。在代码中，Coalesce用于加载数据和获取SparkOutOfMemoryError。当前Coalesce使用了多个项目，并看到了许多建议重新分区的解决方案，它对我有效。即使它没有任何记录，coalesce也不起作用。有没有其他方法可以在不更改为重新分区的情况下解决此问题代码： empsql = 'Select * From Employee' df = spark.sql(empsql) ##Spark is configured df.coale

我有一个数据帧并写入S3存储桶目标位置。在代码中，Coalesce用于加载数据和获取SparkOutOfMemoryError。当前Coalesce使用了多个项目，并看到了许多建议重新分区的解决方案，它对我有效。即使它没有任何记录，coalesce也不起作用。有没有其他方法可以在不更改为重新分区的情况下解决此问题

代码：

empsql = 'Select * From Employee'
df = spark.sql(empsql) ##Spark is configured
df.coalesce(2).write.mode('overwrite').format("parquet").option("delimiter",'|').save(s3_path, header = True)

错误：

empsql = 'Select * From Employee'
df = spark.sql(empsql) ##Spark is configured
df.coalesce(2).write.mode('overwrite').format("parquet").option("delimiter",'|').save(s3_path, header = True)

org.apache.spark.SparkException:任务在写入行时失败。在 org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask 在 org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply 在 org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply 位于org.apache.spark.scheduler.ResultTask.runTask org.apache.spark.scheduler.Task.run at org.apache.spark.executor.executor$TaskRunner$$anonfun$10.apply at org.apache.spark.util.Utils$.tryWithSafeFinally位于 org.apache.spark.executor.executor$TaskRunner.run at 位于的java.util.concurrent.ThreadPoolExecutor.runWorker java.util.concurrent.ThreadPoolExecutor$Worker.run at java.lang.Thread.run 原因：org.apache.spark.memory.SparkOutOfMemoryError:无法获取44字节的内存，在处获取0 org.apache.spark.memory.MemoryConsumer.throwOom（MemoryConsumer.java:）位于org.apache.spark.memory.MemoryConsumer.allocatePage org.apache.spark.util.collection.unsafe.sort.UnsafeeExternalSorter.AcquireNewPageIfEssential（unsafeeExternalSorter.java:383）在 org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord（UnsafeExternalSorter.java:407）在 org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow（UnsafeExternalRowSorter.java:135）在 org.apache.spark.sql.catalyst.expressions.GeneratedClass$generatorForCodeGenStage29.sort_addToSorter_0$（未知来源）在 org.apache.spark.sql.catalyst.expressions.GeneratedClass$GenerateEditorForCodeGenStage29.processNext（未知来源）在 org.apache.spark.sql.execution.BufferedRowIterator.hasNext（BufferedRowIterator.java:43）在 org.apache.spark.sql.execution.whisttagecodegenexec$$anonfun$11$$anon$1.hasNext（whisttagecodegenexec.scala:619）在 org.apache.spark.sql.execution.RowIteratorFromScala.advanceNext（RowIterator.scala:83）在 org.apache.spark.rdd.ZippedPartitionsRDD2.compute（ZippedPartitionsRDD.scala:89）在org.apache.spark.rdd.rdd.computeOrReadCheckpoint（rdd.scala:324）位于org.apache.spark.rdd.rdd.iterator（rdd.scala:288） org.apache.spark.rdd.MapPartitionsRDD.compute（MapPartitionsRDD.scala:52）在org.apache.spark.rdd.rdd.computeOrReadCheckpoint（rdd.scala:324）位于org.apache.spark.rdd.rdd.iterator（rdd.scala:288） org.apache.spark.rdd.coalescadrd$$anonfun$compute$1.apply（coalescadrd.scala:100）在 org.apache.spark.rdd.coalescadrd$$anonfun$compute$1.apply（coalescadrd.scala:99）位于scala.collection.Iterator$$anon$12.nextCur（Iterator.scala:435） scala.collection.Iterator$$anon$12.hasNext（Iterator.scala:441）位于 org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply（FileFormatWriter.scala:241）在 org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply（FileFormatWriter.scala:239）在 org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks（Utils.scala:1394）在 org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask（FileFormatWriter.scala:245）

不确定这是否适用于您，但请尝试这样做

df.coalesce(2,shuffle=true).write.mode('overwrite').format("parquet").option("delimiter",'|').save(s3_path, header = True)

shuflle=true将添加一个洗牌步骤。分区将并行执行。这种行为类似于使用重新分区

不确定这是否对您有效，但请尝试这样做

df.coalesce(2,shuffle=true).write.mode('overwrite').format("parquet").option("delimiter",'|').save(s3_path, header = True)

shuflle=true将添加一个洗牌步骤。分区将并行执行。该行为类似于使用重新分区

我通过在coalesce（）之前通过均匀分布的键添加重新分区（键）来解决问题。

我认为这有助于Spark允许将预排序的数据发送到writer节点，而不在writer executor上进行排序。

我通过在coalesce（）之前通过均匀分布的键添加重新分区（键）来解决我的问题。

我认为这有助于Spark允许将预订购的数据发送到writer节点，而不在writer executor上进行排序。

我们可以看到Spark UI阶段选项卡吗？也许这会给我们提供更多信息。您可以验证您的executor配置，还可以在spark UI中检查是否存在可能导致此问题的任何数据倾斜。我们可以查看spark UI阶段选项卡吗？也许这会给我们提供更多信息。您可以验证您的executor配置，还可以在spark UI中检查是否存在可能导致此问题的任何数据偏差。它是否与coalesce（11）一样使用10个以上的coalesce？我认为它不能将分区数增加到实际分区数以上。相反，使用重新分区函数。coalesce--pros--reduces partition，使用最小的洗牌，不会创建大小相等的分区。如果实际分区数与所需分区数之间的差异很小，则效果最好，不会忽略现有分区的重新分区--增加或减少num分区，倾向于创建大小相等的分区，最好利用并行性（这对产品发布是一个福音），将忽略现有分区并创建新分区它是否可以像coalesce（11）那样使用10个以上的coalesce？我认为它不能增加超过实际分区数的分区数。相反，请使用重新分区函数。coalesce--pros--减少pa