Apache spark Apache Spark WARN MemoryStore:空间不足
我使用Sparklung Water,从拼花地板文件中读取数据 我的spark-default.conf部分:Apache spark Apache Spark WARN MemoryStore:空间不足,apache-spark,Apache Spark,我使用Sparklung Water,从拼花地板文件中读取数据 我的spark-default.conf部分: `spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max 1g spark.driver.memory 40g spark.executor.memory 40g spark.driver.maxResultSize 0 spark.python.wo
`spark.serializer org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max 1g
spark.driver.memory 40g
spark.executor.memory 40g
spark.driver.maxResultSize 0
spark.python.worker.memory 30g
spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintTenuringDistribution
spark.storage.safetyFraction 0.9
spark.storage.memoryFraction 0.0
`
实际上,Spark只使用了它可以使用的内存的一部分,而且在分配内存方面存在很多错误。Spark开始在硬盘上写入数据,而不是使用RAM。为什么会这样?也许我应该更改conf文件中的某些内容?我如何更改Java用作“tmp”的目录
谢谢大家! Spark开始在硬盘上写入数据,而不是使用RAM。为什么会这样
这应该是因为在某个地方,您的持久性设置配置为使用选项内存和磁盘
从文档-->
从源代码->
这一点呢
// Initial memory to request before unrolling any block
private val unrollMemoryThreshold: Long =
conf.get(STORAGE_UNROLL_MEMORY_THRESHOLD)
// Whether there is still enough memory for us to continue unrolling this block
var keepUnrolling = true
// Initial per-task memory to request for unrolling blocks (bytes).
val initialMemoryThreshold = unrollMemoryThreshold
// How often to check whether we need to request more memory
val memoryCheckPeriod = conf.get(UNROLL_MEMORY_CHECK_PERIOD)
// Memory currently reserved by this task for this particular unrolling operation
var memoryThreshold = initialMemoryThreshold
// Memory to request as a multiple of current vector size
val memoryGrowthFactor = conf.get(UNROLL_MEMORY_GROWTH_FACTOR)
// Keep track of unroll memory used by this particular block / putIterator() operation
var unrollMemoryUsedByThisBlock = 0L
再往下走,你会发现这一点
// Initial memory to request before unrolling any block
private val unrollMemoryThreshold: Long =
conf.get(STORAGE_UNROLL_MEMORY_THRESHOLD)
// Whether there is still enough memory for us to continue unrolling this block
var keepUnrolling = true
// Initial per-task memory to request for unrolling blocks (bytes).
val initialMemoryThreshold = unrollMemoryThreshold
// How often to check whether we need to request more memory
val memoryCheckPeriod = conf.get(UNROLL_MEMORY_CHECK_PERIOD)
// Memory currently reserved by this task for this particular unrolling operation
var memoryThreshold = initialMemoryThreshold
// Memory to request as a multiple of current vector size
val memoryGrowthFactor = conf.get(UNROLL_MEMORY_GROWTH_FACTOR)
// Keep track of unroll memory used by this particular block / putIterator() operation
var unrollMemoryUsedByThisBlock = 0L
这就是你看到的错误的来源
// Request enough memory to begin unrolling
keepUnrolling =
reserveUnrollMemoryForThisTask(blockId, initialMemoryThreshold, memoryMode)
if (!keepUnrolling) {
logWarning(s"Failed to reserve initial memory threshold of " +
s"${Utils.bytesToString(initialMemoryThreshold)} for computing block $blockId in memory.")
} else {
unrollMemoryUsedByThisBlock += initialMemoryThreshold
}
因此,要么像在本博客中那样在应用程序级别启用OFF_HEAP-->
或者调整集群/计算机配置,并按此处所述启用此设置-->
最后,如果以上这些都没有帮助,在我的例子中,重新启动节点可以消除警告。如果您进入本文,仍然想知道发生了什么,请参考上面的答案,了解您是如何以及为什么出现此错误的 对我来说,我真的会看看
(到目前为止计算了3.2MB)
,然后开始担心
然而,为了解决:
在创建sparkContext
时,将spark.storage.memoryFraction
标志设置为1
,以使用高达XXGb的内存,默认为提供的总内存的0.6。
也考虑设置:
rdd.compression
到true
及
StorageLevel
asMEMORY\u ONLY\u SER
如果您的数据比可用内存大一些。(您也可以尝试内存和磁盘服务器)
只是浏览了一些旧邮件,偶然发现了以下特性:
**spark.shuffle.spill.numElementsForceSpillThreshold**
我们将其设置为--conf spark.shuffle.spill.numElementsForceSpillThreshold=50000,解决了这个问题,但是这个值需要针对特定的用例进行调整(尝试将该值降低到40000或30000)
截至目前,spark有两个新参数:
-spark.shuffle.spill.map.maxRecordsSizeForSpillThreshold
-spark.shuffle.spill.reduce.maxRecordsSizeForSpillThreshold
参考:
希望有帮助!干杯 请看这个:在我们的一个例子中,通过使用--conf spark.shuffle.spill.numElementsForceSpillThreshold=50000,我们解决了类似的问题。尽管对于大型洗牌: