“线程中的异常”;广播-交换-0“;java.lang.OutOfMemoryError:内存不足,无法构建表并将其广播给所有工作节点
我正在以下配置上运行spark应用程序: 1个主节点,2个工作节点。“线程中的异常”;广播-交换-0“;java.lang.OutOfMemoryError:内存不足,无法构建表并将其广播给所有工作节点,java,apache-spark,apache-spark-sql,apache-spark-2.0,Java,Apache Spark,Apache Spark Sql,Apache Spark 2.0,我正在以下配置上运行spark应用程序: 1个主节点,2个工作节点。 每个工人有88个核心,因此核心总数为176个 每个辅助进程都有502 GB内存,因此总可用内存为1004 GB 运行应用程序时出现以下异常: Exception in thread "broadcast-exchange-0" java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes.
- 每个工人有88个核心,因此核心总数为176个
- 每个辅助进程都有502 GB内存,因此总可用内存为1004 GB
Exception in thread "broadcast-exchange-0" java.lang.OutOfMemoryError: Not enough memory to build and broadcast the table to all worker nodes. As a workaround, you can either disable broadcast by setting spark.sql.autoBroadcastJoinThreshold to -1 or increase the spark driver memory by setting spark.driver.memory to a higher value
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:115)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1$$anonfun$apply$1.apply(BroadcastExchangeExec.scala:73)
at org.apache.spark.sql.execution.SQLExecution$.withExecutionId(SQLExecution.scala:97)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec$$anonfun$relationFuture$1.apply(BroadcastExchangeExec.scala:72)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
此错误本身中提到了两种解决方案:
//1.
Dataset<Row> currencySet1 = sparkSession.read().format("jdbc").option("url",connection ).option("dbtable", CI_CURRENCY_CD).load();
currencySetCache = currencySet1.select(CURRENCY_CD, DECIMAL_POSITIONS).persist(StorageLevel.MEMORY_ONLY());
Dataset<Row> currencyCodes = currencySetCache.select(CURRENCY_CD);
currencySet = currencyCodes.as(Encoders.STRING()).collectAsList();
//2.
Dataset<Row> divisionSet = sparkSession.read().format("jdbc").option("url",connection ).option("dbtable", CI_CIS_DIVISION).load();
divisionSetCache = divisionSet.select(CIS_DIVISION).persist(StorageLevel.MEMORY_ONLY());
divisionList = divisionSetCache.as(Encoders.STRING()).collectAsList();
//3.
Dataset<Row> userIdSet = sparkSession.read().format("jdbc").option("url",connection ).option("dbtable", SC_USER).load();
userIdSetCache = userIdSet.select(USER_ID).persist(StorageLevel.MEMORY_ONLY());
userIdList = userIdSetCache.as(Encoders.STRING()).collectAsList();
ClassTag<List<String>> evidenceForDivision = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForDiv = context.broadcast(divisionList, evidenceForDivision);
ClassTag<List<String>> evidenceForCurrency = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForCurrency = context.broadcast(currencySet, evidenceForCurrency);
ClassTag<List<String>> evidenceForUserID = scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast<List<String>> broadcastVarForUserID = context.broadcast(userIdList, evidenceForUserID);
//Validation -- Start
Encoder<RuleParamsBean> encoder = Encoders.bean(RuleParamsBean.class);
Dataset<RuleParamsBean> ds = new Dataset<RuleParamsBean>(sparkSession, finalJoined.logicalPlan(), encoder);
Dataset<RuleParamsBean> validateDataset = ds.map(ruleParamsBean -> validateTransaction(ruleParamsBean,broadcastVarForDiv.value(),broadcastVarForCurrency.value(),
broadcastVarForUserID.value()),encoder);
validateDataset.persist(StorageLevel.MEMORY_ONLY());
//1。
Dataset currencySet1=sparkSession.read().format(“jdbc”).option(“url”,connection).option(“dbtable”,CI_CURRENCY_CD).load();
currencySetCache=currencySet1.select(货币、十进制位置)。persist(仅限StorageLevel.MEMORY());
数据集currencyCodes=currencySetCache.select(货币\u CD);
currencySet=currencyCodes.as(Encoders.STRING()).collectAsList();
//2.
Dataset DIVISION set=sparkSession.read().format(“jdbc”).option(“url”,connection).option(“dbtable”,CI_CIS_DIVISION).load();
divisionSetCache=divisionSet.select(CIS_DIVISION).persist(仅限StorageLevel.MEMORY_());
divisionList=divisionSetCache.as(Encoders.STRING()).collectAsList();
//3.
数据集userIdSet=sparkSession.read().format(“jdbc”).option(“url”,connection).option(“dbtable”,SC_USER).load();
userIdSetCache=userIdSet.select(USER_ID).persist(StorageLevel.MEMORY_ONLY());
userIdList=userIdSetCache.as(Encoders.STRING()).collectAsList();
ClassTag-evidenceForDivision=scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast broadcastVarForDiv=context.Broadcast(部门列表、证据查询);
ClassTag-EvidenceRecurrency=scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast broadcastVarForCurrency=context.Broadcast(currencySet,证据循环);
ClassTag-evidenceForUserID=scala.reflect.ClassTag$.MODULE$.apply(List.class);
Broadcast broadcastVarForUserID=context.Broadcast(userIdList,confidenceForUserId);
//验证--启动
编码器=Encoders.bean(RuleParamsBean.class);
数据集ds=新数据集(sparkSession,finalJoined.logicalPlan(),编码器);
Dataset validateDataset=ds.map(ruleParamsBean->validateTransaction(ruleParamsBean,broadcastVarForDiv.value(),broadcastVarForCurrency.value(),
broadcastVarForUserID.value()),编码器);
validateDataset.persist(仅限StorageLevel.MEMORY_());
可能的根本原因:默认值“spark.driver.memory”仅为1GB(取决于分配),这是一个非常小的数字。若您正在读取驱动程序上的大量数据,OutOfMemory很容易发生,异常的建议是正确的
解决方案:将“spark.driver.memory”和“spark.executor.memory”至少增加到16Gb。您的广播变量包含什么以及它消耗了多少内存?@tauitdnmd我添加了一些代码作为参考,这些代码描述了广播变量,基本上每个变量都包含表中一列的值,有3个广播变量。spark.driver.memory、spark.executor.memory的配置值是什么?@pasha701 for spark.driver.memory的配置值是默认值我没有显式设置,并且--executor memory=6G和--num executors=22。是的,我将两个参数都增加了26GB,错误得到解决。谢谢,我正在努力更好地理解它。您说:“如果您正在读取驱动程序上的大量数据,可能会发生OutOfMemory”,但文档中说:“我的数据是否需要存储在内存中才能使用Spark?不可以。Spark的操作人员会在数据不适合内存的情况下将数据溢出到磁盘,从而使其能够在任何大小的数据上正常运行。同样,不适合内存的缓存数据集要么溢出到磁盘,要么根据RDD的存储级别在需要时动态重新计算。“那么,为什么我必须增加我的驱动程序和执行程序内存?我这里缺少一些东西……我猜,你的引用与广播变量无关。”。