Apache spark 遭遇火花异常“；无法广播大于8GB的表；_Apache Spark_Spark Dataframe

Apache spark 遭遇火花异常“；无法广播大于8GB的表；

apache-spark

Apache spark 遭遇火花异常“；无法广播大于8GB的表；,apache-spark,spark-dataframe,Apache Spark,Spark Dataframe,我正在使用Spark 2.2.0进行数据处理。我正在使用Dataframe.join将两个数据帧连接在一起，但是我遇到了以下堆栈跟踪： 18/03/29 11:27:06信息yarnalocator:驱动程序请求的执行器总数为0。 18/03/29 11:27:09错误FileFormatWriter:正在中止作业null。 org.apache.spark.SparkException:结果中引发的异常：位于org.apache.spark.util.ThreadUtils$.awaitR

我正在使用Spark 2.2.0进行数据处理。我正在使用Dataframe.join将两个数据帧连接在一起，但是我遇到了以下堆栈跟踪：

18/03/29 11:27:06信息yarnalocator:驱动程序请求的执行器总数为0。
18/03/29 11:27:09错误FileFormatWriter:正在中止作业null。
org.apache.spark.SparkException:结果中引发的异常：
位于org.apache.spark.util.ThreadUtils$.awaitResult（ThreadUtils.scala:205）
在org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast（BroadcastExchangeExec.scala:123）上
位于org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast（whistAgeCodeGeneXec.scala:248）
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply（SparkPlan.scala:127）
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply（SparkPlan.scala:127）
位于org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply（SparkPlan.scala:138）
位于org.apache.spark.rdd.RDDOperationScope$.withScope（RDDOperationScope.scala:151）
位于org.apache.spark.sql.execution.SparkPlan.executeQuery（SparkPlan.scala:135）
位于org.apache.spark.sql.execution.SparkPlan.executeBroadcast（SparkPlan.scala:126）
在org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.prepareBroadcast（BroadcastHashJoinExec.scala:98）上
位于org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.CodeGeniner（BroadcastHashJoinExec.scala:197）
位于org.apache.spark.sql.execution.joins.BroadcastHashJoinExec.doConsume（BroadcastHashJoinExec.scala:82）
位于org.apache.spark.sql.execution.CodegenSupport$class.consumer（whisttagecodegenexec.scala:155）
...........
原因：org.apache.spark.SparkException:无法广播大于8GB的表：10 GB
在org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply（BroadcastExchangeExec.scala:86）
在org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply（BroadcastExchangeExec.scala:73）
位于org.apache.spark.sql.execution.SQLExecution$.withExecutionId（SQLExecution.scala:103）
在org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply（BroadcastExchangeExec.scala:72）
在org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply（BroadcastExchangeExec.scala:72）
在scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1（Future.scala:24）
在scala.concurrent.impl.Future$PromiseCompletingRunnable.run（Future.scala:24）
位于java.util.concurrent.ThreadPoolExecutor.runWorker（ThreadPoolExecutor.java:1149）
位于java.util.concurrent.ThreadPoolExecutor$Worker.run（ThreadPoolExecutor.java:624）
在java.lang.Thread.run（Thread.java:748）

当前，spark中的广播变量大小应小于8GB是一个硬限制。看

8GB的大小通常足够大。如果您考虑使用100个执行器运行一个作业，则SCAPLE驱动程序需要将8GB数据发送到100个节点，从而导致800 GB的网络流量。如果您不广播并使用简单连接，则此成本将大大降低

阅读了一些内容后，我尝试禁用自动广播，但它似乎起了作用。使用以下命令更改Spark配置：

'spark.sql.autoBroadcastJoinThreshold': '-1'

请包括您的DF创建和加入代码。我正在使用此'spark.sql.autoBroadcastJoinThreshold'：'-1'但仍然收到相同的错误，我还应该尝试什么？我正在使用'spark.sql.autoBroadcastJoinThreshold'：'-1'配置但仍然面临问题我正在使用“unionByName”但没有加入，为什么会收到此错误？在哪里可以检查导致此问题的连接？我正在使用此“spark.sql.autoBroadcastJoinThreshold”：“-1”，但仍然收到相同的错误，我还应该尝试什么？