Scala 为什么Spark会出现FetchFailed错误?
我在ApacheMesos上使用ApacheZeppelin,它有4个节点,总容量为210GB 我的Spark工作是在事务的小数据集和事件的大数据集之间进行关联。我希望根据时间和ID(事件时间和事务时间,ID和ID)将每个事务与最近的事件相匹配 我得到以下错误:Scala 为什么Spark会出现FetchFailed错误?,scala,apache-spark,mesos,apache-zeppelin,Scala,Apache Spark,Mesos,Apache Zeppelin,我在ApacheMesos上使用ApacheZeppelin,它有4个节点,总容量为210GB 我的Spark工作是在事务的小数据集和事件的大数据集之间进行关联。我希望根据时间和ID(事件时间和事务时间,ID和ID)将每个事务与最近的事件相匹配 我得到以下错误: FetchFailed(null, shuffleId=1, mapId=-1, reduceId=20, message=org.apache.spark.shuffle.MetadataFetchFailedException:
FetchFailed(null, shuffleId=1, mapId=-1, reduceId=20,
message=org.apache.spark.shuffle.MetadataFetchFailedException: Missing an output location for shuffle 1
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:542)
at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$2.apply(MapOutputTracker.scala:538)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:538)
at org.apache.spark.MapOutputTracker.getMapSizesByExecutorId(MapOutputTracker.scala:155)
at org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:47)
at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:98)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:140)
at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:136)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:136)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
这是我的算法
val groupRDD = event
.map { evt => ((evt.id, evt.date_time.toString.dropRight(8)), cdr) }
.groupByKey(new HashPartitioner(128))
.persist(StorageLevel.MEMORY_AND_DISK_SER)
val joinedRDD = groupRDD.rightOuterJoin {
transactions.keyBy { transac => (transac.id, transac.dateTime.toString.dropRight(8)) }}
val result = joinedRDD.mapValues { case(a,b) =>
val goodTransac = a.getOrElse(List(GeoLoc("",0L,"","","","","")))
.reduce((v1,v2) => minDelay(b.dateTime,v1,v2))
SomeClass(b.id, b....., goodTransac.date_time,.....)
}
groupByKey
不应将太多元素分组(每个键最多50个)
我注意到错误发生在内存太短时,因此我决定在RAM和磁盘上持久序列化,并将序列化程序更改为Kryo。我还将spark.memory.storageFraction
减少到0.2
,以便为处理留出更多空间
当我检查web UI时,我发现GC在处理过程中花费了越来越多的时间。当作业最终失败时,GC在22分钟的运行时间内花费20分钟,但不是在所有工作人员身上
我已经检查过了,但是我的集群仍然有大量的RAM—在Mesos上大约有90 GB的空闲空间。我要做的是检查
事件
RDD和groupByKey
之后的分区数。使用
使用StorageLevel.MEMORY\u和\u DISK\u SER
将需要更多的IO,这会减慢执行器的速度,并且给定的SER
可能会导致更长的GC(毕竟,数据集在内存中,它们必须序列化,这几乎是内存需求的两倍)
我强烈建议此时不要使用MEMORY\u和\u DISK\u SER
我还将查看result
RDD的依赖关系图,看看每个阶段中使用了多少洗牌和分区
result.toDebugString
很少有地方会出错
p、 附加来自web UI的作业、阶段、存储和执行者页面的屏幕截图将非常有助于缩小根本原因。您如何知道这是由于大型GC造成的?是否存在其他错误?您如何提交作业以及集群类型(即独立、纱线或mesos)?我看到有时候GC在22分钟的运行时间内花了20分钟太多的时间。我用的是飞艇。我已经检查了这个链接,但是我的集群仍然有足够的内存。我有4个节点,总共210 GB,但MesosCan上仍然有90 GB的可用空间。请在web UI中包含作业和阶段的屏幕截图?