SPARK:java.lang.NegativeArraySizeException,同时尝试在GraphX中加载大图

SPARK:java.lang.NegativeArraySizeException,同时尝试在GraphX中加载大图,java,scala,graph,apache-spark,spark-graphx,Java,Scala,Graph,Apache Spark,Spark Graphx,我正在尝试使用spark1.4.1中的GraphX在本地模式下加载一个大图形(60GB),其中包含16个线程。spark-defaults.conf内的驱动程序内存设置为500GB。我在一台有590341 MB空闲空间(由free-m命令显示)的机器上工作,实际上是576GB 关于我试图加载的图形的一些细节:它是从snap.stanford.edu下载的Friendster图形。实际大小是30GB,因为Friendster是一个有向图,但我将边加倍以创建一个无向图版本。(这就是为什么大小加倍60

我正在尝试使用spark1.4.1中的GraphX在本地模式下加载一个大图形(60GB),其中包含16个线程。spark-defaults.conf内的驱动程序内存设置为500GB。我在一台有590341 MB空闲空间(由free-m命令显示)的机器上工作,实际上是576GB

关于我试图加载的图形的一些细节:它是从snap.stanford.edu下载的Friendster图形。实际大小是30GB,因为Friendster是一个有向图,但我将边加倍以创建一个无向图版本。(这就是为什么大小加倍60=2*30)。我使用scala使用GraphLoader.edgelist文件加载图形,如下所示:

def main(args: Array[String]): Unit = {

   val logFile = "/home/panos/spark-1.2.0/README.md"
   // Create spark configuration and spark context
   val conf = new SparkConf().setAppName("My    App").setMaster("local[16]")
   val sc = new SparkContext(conf)

   val currentDir = System.getProperty("user.dir") // get the current directory
   val edgeFile = "file://" + currentDir + "/friendsterUndirected.txt"

   // Load the edges as a graph   
   val graph = GraphLoader.edgeListFile(sc,    edgeFile,false,1,StorageLevel.MEMORY_AND_DISK,StorageLevel.MEMORY_AND_DISK)
 }
问题是,接收java.lang.negativeArraySize异常时无法加载图形,如下所示。我尝试使用较小的图形(~4GB)并获得了成功。有什么想法吗

16/01/10 13:04:09 WARN TaskSetManager: Stage 0 contains a task of very    large size (440 KB). The maximum recommended task size is 100 KB.
16/01/10 13:32:56 ERROR Executor: Exception in task 0.0 in stage 0.0  (TID 0)
java.lang.NegativeArraySizeException
at java.lang.reflect.Array.newArray(Native Method)
at java.lang.reflect.Array.newInstance(Array.java:75)
at scala.reflect.ClassTag$class.newArray(ClassTag.scala:62)
at scala.reflect.ClassTag$$anon$1.newArray(ClassTag.scala:144)
at org.apache.spark.util.collection.PrimitiveVector.copyArrayWithLength(PrimitiveVector.scala:87)
at org.apache.spark.util.collection.PrimitiveVector.resize(PrimitiveVector.scala:74)
at org.apache.spark.util.collection.PrimitiveVector.$plus$eq(PrimitiveVector.scala:41)
at org.apache.spark.graphx.impl.EdgePartitionBuilder$mcI$sp.add$mcI$sp(EdgePartitionBuilder.scala:34)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:87)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:76)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:76)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/01/10 13:32:56 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NegativeArraySizeException
at java.lang.reflect.Array.newArray(Native Method)
at java.lang.reflect.Array.newInstance(Array.java:75)
at scala.reflect.ClassTag$class.newArray(ClassTag.scala:62)
at scala.reflect.ClassTag$$anon$1.newArray(ClassTag.scala:144)
at org.apache.spark.util.collection.PrimitiveVector.copyArrayWithLength(PrimitiveVector.scala:87)
at org.apache.spark.util.collection.PrimitiveVector.resize(PrimitiveVector.scala:74)
at org.apache.spark.util.collection.PrimitiveVector.$plus$eq(PrimitiveVector.scala:41)
at org.apache.spark.graphx.impl.EdgePartitionBuilder$mcI$sp.add$mcI$sp(EdgePartitionBuilder.scala:34)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:87)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:76)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:76)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

16/01/10 13:32:56 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost): java.lang.NegativeArraySizeException
at java.lang.reflect.Array.newArray(Native Method)
at java.lang.reflect.Array.newInstance(Array.java:75)
at scala.reflect.ClassTag$class.newArray(ClassTag.scala:62)
at scala.reflect.ClassTag$$anon$1.newArray(ClassTag.scala:144)
at org.apache.spark.util.collection.PrimitiveVector.copyArrayWithLength(PrimitiveVector.scala:87)
at org.apache.spark.util.collection.PrimitiveVector.resize(PrimitiveVector.scala:74)
at org.apache.spark.util.collection.PrimitiveVector.$plus$eq(PrimitiveVector.scala:41)
at org.apache.spark.graphx.impl.EdgePartitionBuilder$mcI$sp.add$mcI$sp(EdgePartitionBuilder.scala:34)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:87)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:76)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:76)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$18.apply(RDD.scala:703)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:69)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:242)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1273)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1264)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1263)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1263)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1457)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1418)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

我不知道这件事发生了什么。但是“负数组大小”让我怀疑是否有整数算术溢出。是否有一个隐含的假设,即数组足够小,计算其大小不会导致整数溢出?很抱歉,我不能说得更具体了。我看到过类似的帖子也有同样的问题。有一个开放的GraphX bug:我不知道在这种情况下发生了什么。但是“负数组大小”让我怀疑是否有整数算术溢出。是否有一个隐含的假设,即数组足够小,计算其大小不会导致整数溢出?很抱歉,我不能说得更具体了。我看到过类似的帖子也有同样的问题。有一个打开的GraphX错误: