Hadoop 驱动程序中可能存在java.lang.OutOfMemoryError:无法保存Word2Vec模型

Hadoop 驱动程序中可能存在java.lang.OutOfMemoryError:无法保存Word2Vec模型,hadoop,apache-spark,word2vec,Hadoop,Apache Spark,Word2vec,使用Spark v1.6.0 我正在尝试在我们的集群上输入单词向量。但是,我得到了一个java.lang.OutOfMemoryError异常(参见下面的日志输出)。该任务有40GB的可用RAM,输入是一个5.5GB的JSON文件 由于所有阶段都在完成它的任务,我认为问题已经到了关键时刻 model.save(sc, outputFile) 但我不是100%确定,即使我是,我也不知道如何避免这个问题。请参见下面的程序框架。如果需要,我可以提供完整的源代码——我删除了预处理部分以简化代码 obj

使用Spark v1.6.0

我正在尝试在我们的集群上输入单词向量。但是,我得到了一个
java.lang.OutOfMemoryError
异常(参见下面的日志输出)。该任务有40GB的可用RAM,输入是一个5.5GB的JSON文件

由于所有阶段都在完成它的任务,我认为问题已经到了关键时刻

model.save(sc, outputFile)
但我不是100%确定,即使我是,我也不知道如何避免这个问题。请参见下面的程序框架。如果需要,我可以提供完整的源代码——我删除了预处理部分以简化代码

objectword2veconcluster{
def main(参数:数组[字符串]){
val inputFile=新文件(args(0))
val outputDirectory=新文件(args(1))
val conf=new SparkConf().setAppName(“Word2VecOnCluster”)
如果(参数长度>2){
conf.setMaster(args(2))
}
outputDirectory.mkdirs()
val sc=新的SparkContext(配置)
val file=sc.textFile(inputFile.getAbsolutePath)
val wordSequence=文件
.重新分配(500)
.mapPartitions(lineIterator=>{
//做一些预处理。。
})    
.map(line=>line.split(“\\s+”).toSeq)
val word2vec=新的word2vec()
val模型=word2vec
.setNumPartitions(500)
.fit(单词顺序)
val outputFile=outputDirectory.getAbsolutePath+File.separator+inputFile.getName
model.save(sc,outputFile)
系统退出(0)
}
}
这是阶段2.0的最后任务完成后的日志输出:

16/04/06 12:06:33 INFO TaskSetManager: Finished task 428.0 in stage 2.0 (TID 970) in 701 ms on node12.hadoop.company.at (499/500)
16/04/06 12:06:33 INFO TaskSetManager: Finished task 439.0 in stage 2.0 (TID 981) in 680 ms on node12.hadoop.company.at (500/500)
16/04/06 12:06:33 INFO YarnScheduler: Removed TaskSet 2.0, whose tasks have all completed, from pool 
16/04/06 12:06:33 INFO DAGScheduler: ResultStage 2 (collect at Word2Vec.scala:170) finished in 2.356 s
16/04/06 12:06:33 INFO DAGScheduler: Job 0 finished: collect at Word2Vec.scala:170, took 439.713694 s
16/04/06 12:06:33 INFO Word2Vec: trainWordsCount = 685345162
16/04/06 12:06:33 INFO MemoryStore: Block broadcast_4 stored as values in memory (estimated size 3.9 KB, free 357.7 KB)
16/04/06 12:06:33 INFO MemoryStore: Block broadcast_4_piece0 stored as bytes in memory (estimated size 4.0 KB, free 361.7 KB)
16/04/06 12:06:33 INFO BlockManagerInfo: Added broadcast_4_piece0 in memory on 192.168.0.95:42051 (size: 4.0 KB, free: 511.1 MB)
16/04/06 12:06:33 INFO SparkContext: Created broadcast 4 from broadcast at Word2Vec.scala:290
16/04/06 12:06:33 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 150.6 MB, free 150.9 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 4.0 MB, free 154.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 507.1 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece1 stored as bytes in memory (estimated size 4.0 MB, free 158.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece1 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 503.1 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece2 stored as bytes in memory (estimated size 4.0 MB, free 162.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece2 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 499.1 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_5_piece3 stored as bytes in memory (estimated size 2.9 MB, free 165.8 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_5_piece3 in memory on 192.168.0.95:42051 (size: 2.9 MB, free: 496.2 MB)
16/04/06 12:06:35 INFO SparkContext: Created broadcast 5 from broadcast at Word2Vec.scala:291
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_6 stored as values in memory (estimated size 33.2 MB, free 199.0 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_6_piece0 stored as bytes in memory (estimated size 4.0 MB, free 203.0 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_6_piece0 in memory on 192.168.0.95:42051 (size: 4.0 MB, free: 492.2 MB)
16/04/06 12:06:35 INFO MemoryStore: Block broadcast_6_piece1 stored as bytes in memory (estimated size 870.5 KB, free 203.9 MB)
16/04/06 12:06:35 INFO BlockManagerInfo: Added broadcast_6_piece1 in memory on 192.168.0.95:42051 (size: 870.5 KB, free: 491.3 MB)
16/04/06 12:06:35 INFO SparkContext: Created broadcast 6 from broadcast at Word2Vec.scala:292
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:3236)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1877)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1786)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1189)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:742)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:741)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
    at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:741)
    at org.apache.spark.mllib.feature.Word2Vec$$anonfun$fit$1.apply$mcVI$sp(Word2Vec.scala:329)
    at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
    at org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:328)
    at masterthesis.code.wordvectors.Word2VecOnCluster$.main(Word2VecOnCluster.scala:112)
    at masterthesis.code.wordvectors.Word2VecOnCluster.main(Word2VecOnCluster.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.0.95:42051 in memory (size: 2.2 KB, free: 491.3 MB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node10.hadoop.company.at:36555 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node04.hadoop.company.at:33602 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node13.hadoop.company.at:53455 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node11.hadoop.company.at:50336 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node02.hadoop.company.at:52435 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node06.hadoop.company.at:49865 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node09.hadoop.company.at:44672 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node01.hadoop.company.at:33026 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node14.hadoop.company.at:38802 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node03.hadoop.company.at:48959 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node12.hadoop.company.at:60505 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node05.hadoop.company.at:43832 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node07.hadoop.company.at:56636 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node08.hadoop.company.at:51583 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_3_piece0 on node15.hadoop.company.at:41850 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO ContextCleaner: Cleaned accumulator 3
16/04/06 12:06:38 INFO ContextCleaner: Cleaned accumulator 2
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.0.95:42051 in memory (size: 2.2 KB, free: 491.3 MB)
16/04/06 12:06:38 INFO SparkContext: Invoking stop() from shutdown hook
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node14.hadoop.company.at:38802 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node05.hadoop.company.at:43832 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node12.hadoop.company.at:60505 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node08.hadoop.company.at:51583 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node03.hadoop.company.at:48959 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node15.hadoop.company.at:41850 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node07.hadoop.company.at:56636 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node13.hadoop.company.at:53455 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node06.hadoop.company.at:49865 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node10.hadoop.company.at:36555 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node04.hadoop.company.at:33602 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node02.hadoop.company.at:52435 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node09.hadoop.company.at:44672 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node11.hadoop.company.at:50336 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO BlockManagerInfo: Removed broadcast_1_piece0 on node01.hadoop.company.at:33026 in memory (size: 2.2 KB, free: 28.5 GB)
16/04/06 12:06:38 INFO ContextCleaner: Cleaned accumulator 1
16/04/06 12:06:38 INFO ContextCleaner: Cleaned shuffle 1
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/metrics/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/kill,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/api,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/static,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/threadDump,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/executors,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/environment,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/rdd,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/storage,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/pool,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/stage,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/stages,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/job,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs/json,null}
16/04/06 12:06:38 INFO ContextHandler: stopped o.s.j.s.ServletContextHandler{/jobs,null}
16/04/06 12:06:38 INFO SparkUI: Stopped Spark web UI at http://192.168.0.95:4040
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Shutting down all executors
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Interrupting monitor thread
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Asking each executor to shut down
16/04/06 12:06:38 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
(serviceOption=None,
 services=List(),
 started=false)
16/04/06 12:06:38 INFO YarnClientSchedulerBackend: Stopped
16/04/06 12:06:38 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/04/06 12:06:38 INFO MemoryStore: MemoryStore cleared
16/04/06 12:06:38 INFO BlockManager: BlockManager stopped
16/04/06 12:06:38 INFO BlockManagerMaster: BlockManagerMaster stopped
16/04/06 12:06:38 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/04/06 12:06:38 INFO SparkContext: Successfully stopped SparkContext
16/04/06 12:06:38 INFO ShutdownHookManager: Shutdown hook called
16/04/06 12:06:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-3f6c9bf3-be08-4eac-bc12-f6110beedb60
16/04/06 12:06:38 INFO ShutdownHookManager: Deleting directory /tmp/spark-3f6c9bf3-be08-4eac-bc12-f6110beedb60/httpd-cc65e6a2-705b-4da0-8845-54992f85ddb8
16/04/06 12:06:38 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
如您所见,阶段结束,指向驱动程序程序。请忽略“(1失败)”这只是一个特殊节点,有时拒绝工作^^


如何解决此问题?

查看org.apache.spark.mllib.feature.Word2Vec.fit(Word2Vec.scala:328)中的行“我假设错误是由model.fit引起的。可能是后来它重试了失败的任务吗?@NiekBartholomeus不,那之后我就把工作干掉了。我认为问题可能是实际模型非常大。最后似乎有7亿个字(
trainwordscont=685345162
),这意味着7亿乘以100个维度乘以4,甚至8个字节。我认为这确实可能会造成问题。@NiekBartholomeus但是,谷歌word2vec模型的300维尺寸,如果我没记错的话,我认为大约有3,5GB的尺寸/如果我没有弄错的话,word2vec算法不需要将整个共现矩阵保存在内存中,因此我想说,您的40GB内存应该足以容纳5GB的语料库。您是否为spark驱动程序和/或执行器指定了足够的内存?您能否监控RAM并验证其在培训期间是否全部使用?您是否解决了该问题?我们认为我们面临着类似的问题(尽管数据和内存更少),最终使用了“.ml.”实现,但在我们的例子中,在保存模型(集群上)时确实存在一些问题。您案例中的问题可能确实存在于fit过程中,因为显示的构建数组类似于
val syn1Global=new Array[Float](vocabSize*vectorSize)
,因此可能出现OOM错误。