Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/19.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala Spark 1.5.2洗牌/序列化-内存不足_Scala_Serialization_Apache Spark_Shuffle - Fatal编程技术网

Scala Spark 1.5.2洗牌/序列化-内存不足

Scala Spark 1.5.2洗牌/序列化-内存不足,scala,serialization,apache-spark,shuffle,Scala,Serialization,Apache Spark,Shuffle,我正在使用几百GB的数据集(大约2B行)。操作之一是将RDD或scala case对象(包含double、map、set)缩减为单个实体。最初,我的操作执行的是groupByKey,但速度很慢,GC很高。因此,我尝试将其转换为aggregateByKey,后来甚至转换为reduceebykey,希望避免我在groupBy中遇到的高用户内存分配、无序活动和高gc问题 应用程序资源:23GB exec mem+4GB开销。20个实例和6个核心。用洗牌配给从0.2到0.4玩 可用群集资源10个节点,纱

我正在使用几百GB的数据集(大约2B行)。操作之一是将RDD或scala case对象(包含double、map、set)缩减为单个实体。最初,我的操作执行的是
groupByKey
,但速度很慢,GC很高。因此,我尝试将其转换为
aggregateByKey
,后来甚至转换为
reduceebykey
,希望避免我在groupBy中遇到的高用户内存分配、无序活动和高gc问题

应用程序资源:23GB exec mem+4GB开销。20个实例和6个核心。用洗牌配给从0.2到0.4玩

可用群集资源10个节点,纱线总计600GB,最大容器大小32GB

2016-05-02 22:38:53,595 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 3 to hdn2.mycorp:45993
2016-05-02 22:38:53,832 INFO [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.storage.BlockManagerInfo: Removed broadcast_4_piece0 on 10.250.70.117:52328 in memory (size: 2.1 KB, free: 15.5 MB)
2016-05-02 22:39:03,704 WARN [New I/O worker #5] org.jboss.netty.channel.DefaultChannelPipeline: An exception was thrown by a user handler while handling an exception event ([id: 0xa8147f0c, /10.250.70.110:48056 => /10.250.70.117:38300] EXCEPTION: java.lang.OutOfMemoryError: Java heap space)
java.lang.OutOfMemoryError: Java heap space
        at java.nio.HeapByteBuffer.<init>(HeapByteBuffer.java:57)
        at java.nio.ByteBuffer.allocate(ByteBuffer.java:331)
        at org.jboss.netty.buffer.CompositeChannelBuffer.toByteBuffer(CompositeChannelBuffer.java:649)
        at org.jboss.netty.buffer.AbstractChannelBuffer.toByteBuffer(AbstractChannelBuffer.java:530)
        at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:77)
        at org.jboss.netty.channel.socket.nio.SocketSendBufferPool.acquire(SocketSendBufferPool.java:46)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.write0(AbstractNioWorker.java:194)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.writeFromTaskLoop(AbstractNioWorker.java:152)
        at org.jboss.netty.channel.socket.nio.AbstractNioChannel$WriteTask.run(AbstractNioChannel.java:335)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.processTaskQueue(AbstractNioSelector.java:366)
        at org.jboss.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:290)
        at org.jboss.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:90)
        at org.jboss.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-14] org.apache.spark.rpc.akka.ErrorMonitor: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
        at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
        at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
        at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
        at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
        at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
        at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
        at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
        at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:843)
        at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
        at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:842)
        at akka.remote.EndpointWriter.writeSend(Endpoint.scala:743)
        at akka.remote.EndpointWriter$$anonfun$2.applyOrElse(Endpoint.scala:718)
        at akka.actor.Actor$class.aroundReceive(Actor.scala:467)
        at akka.remote.EndpointActor.aroundReceive(Endpoint.scala:411)
        at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
        at akka.actor.ActorCell.invoke(ActorCell.scala:487)
        at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
        at akka.dispatch.Mailbox.run(Mailbox.scala:220)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:397)
        at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
        at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
        at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
        at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
2016-05-02 22:39:05,783 ERROR [sparkDriver-akka.actor.default-dispatcher-2] akka.actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-5] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
        at java.util.Arrays.copyOf(Arrays.java:2271)
        at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
        at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
        at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
        at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
        at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
        at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
        at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
        at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
        at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
                                                                                                                                                                                                                              67247,1       99%

为了能够提供帮助,您应该发布代码,并对输入数据进行解释

为什么是数据? 在按键聚合时,为了实现最佳并行性并避免出现问题,了解键分布以及基数是很重要的

让我解释一下它们是什么以及它们为什么重要。 假设按国家进行聚合…地球上大约有250个国家,所以密钥的基数约为250

基数很重要,因为低基数可能会扼杀并行性。例如,如果90%的数据是针对美国的,并且您有250个节点,那么一个节点将处理90%的数据

这就引出了分布的概念,也就是说,当你按键分组时,每个键有多少值就是你的值分布。为了获得最佳并行性,理想情况下,您希望每个键的值数量大致相同

现在,如果数据的基数很高,但值分布不是最优的,那么从统计学上来说,情况应该是平衡的。 例如,假设您有apache日志,其中大多数用户只访问几个页面,但有些用户访问很多页面(机器人就是这样)。 如果用户数量远大于节点数量,则具有大量数据的用户分布在节点周围,因此并行性不会受到影响

当使用基数较低的键时,通常会出现问题。 如果值的分布不好,就会导致洗衣机不平衡的问题


最后但并非最不重要的一点是,这在很大程度上取决于您在aggregateByKey上执行的操作。如果在映射或缩减处理阶段泄漏对象,则很容易耗尽内存。

所以问题实际上是驱动程序内存不足,而不是执行器。因此,驱动程序日志中存在错误。嗯。然而,从日志中看不太清楚。驱动程序耗尽是因为1)它使用了默认的-Xmx900m 2)Spark驱动程序依赖于akka libs,akka libs依赖于顽固的JavaSerializer,后者使用字节数组而不是流来序列化对象。作为一个临时解决方案,我将spark.driver.memory增加到4096m,此后我再也没有看到内存错误。感谢大家对问题空间的一些见解

为了帮助您,我们至少需要查看一些代码,以了解您对这些2B行的实际操作。只是添加了一些工作描述和伪代码。如果我需要提供更多信息,请告诉我。谢谢。我知道数据分布不均匀。和初始数据集相比,最终键的基数是100:1。1B输入设置-10M输出设置。这有意义吗?但我想我应该在洗牌之前和洗牌之后比较操作失败的地方,对吗?你能解释一下为什么我在驱动程序日志中看到这些错误吗?这是执行人还是司机的错误?这是由于给执行器和/或驱动程序的总体内存,还是由于spark shuffle块本身的限制?
//pseudo code where I use aggregateByKey 

case class UserDataSet(salary: Double, members: Int, clicks: Map[Int, Long],
    businesses: Map[Int, Set[Int]])...) //About 10 fileds with 5 of them are maps

def main() = {

      create combinationRDD of type (String, Set[Set]) Rdd from input dataset which represent all combination
      create a joinedRdd of type (String, UserDataSet) - where key at this point already a final key which contains 10 unique fields; value is a UserDataSet

//This is where things fails
     val finalDataSet = joinedRdd.aggregateByKey(UserDataSet.getInstance())(processDataSeq, processDataMerge)

}    

private def processDataMerge(map1: UserDataSet, map2: UserDataSet) = { 

    map1.clicks ++= map2.clicks (deep merge of course to avoid overwriting of map keys)
    map1.salary += map2.salary

    map1
}