Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Apache Spark GraphX java.lang.ArrayIndexOutOfBoundsException_Java_Scala_Hadoop_Apache Spark_Spark Graphx - Fatal编程技术网

Apache Spark GraphX java.lang.ArrayIndexOutOfBoundsException

Apache Spark GraphX java.lang.ArrayIndexOutOfBoundsException,java,scala,hadoop,apache-spark,spark-graphx,Java,Scala,Hadoop,Apache Spark,Spark Graphx,我试图了解如何使用Spark GraphX,但始终存在一些问题,因此可能有人会建议我阅读什么等。我试图阅读Spark文档并学习Spark-O'Reilly Media book,但找不到任何解释,说明我们需要多少内存来处理不同大小的网络等 对于我的测试,我使用了几个示例数据集。我在Spark shell的1个主节点(约16Gb RAM)上运行它们: ./bin/spark-shell --master spark://192.168.0.12:7077 --executor-memory 290

我试图了解如何使用Spark GraphX,但始终存在一些问题,因此可能有人会建议我阅读什么等。我试图阅读Spark文档并学习Spark-O'Reilly Media book,但找不到任何解释,说明我们需要多少内存来处理不同大小的网络等

对于我的测试,我使用了几个示例数据集。我在Spark shell的1个主节点(约16Gb RAM)上运行它们:

./bin/spark-shell --master spark://192.168.0.12:7077 --executor-memory 2900m --driver-memory 10g
和3-5名工人(每台单独的机器1名工人,具有4Gb RAM):

然后从Spark Shell运行scala脚本(未编译):

我还没有使用HDFS,只是将数据集文件复制到每台机器上(当然使用相同的路径名)。在像zachary club这样的小型网络或更大的~256MB网络上(在增加驱动程序内存参数后),我能够计算三角形、楔形等

现在尝试处理750+Mb的网络,并出现错误。例如,我有两列格式的Wikipedia链接数据集(link_from link_to),750Mb。尝试加载它:

val graph = GraphLoader.edgeListFile(sc, "graphx/data/dbpidia")
并获取一个错误:

[Stage 0:==============================================>     (22 + 1) / 23]
15/04/30 22:52:46 WARN TaskSetManager: Lost task 22.0 in stage 0.0 (TID 22, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:76)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:76)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/30 22:52:47 WARN TaskSetManager: Lost task 22.2 in stage 0.0 (TID 24, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException
实际上,我需要处理大于1Tb的数据集,但即使在较小的数据集上也会出现错误。我做错了什么?内存限制是什么?您可以为>>1Tb文件提出什么策略,如何更好地存储它们?
谢谢。

这可能是GraphX的一个bug


我和你有同样的问题。它在小数据集上运行良好。当数据大小变大时,spark抛出
ArrayIndexOutOfBoundsException
error

该数据集是否公开可用?看起来数据中可能存在错误的顶点引用。您是否尝试过将文件加载到RDD[(字符串,字符串)]中,然后将其解析为边?
val graph = GraphLoader.edgeListFile(sc, "graphx/data/dbpidia")
[Stage 0:==============================================>     (22 + 1) / 23]
15/04/30 22:52:46 WARN TaskSetManager: Lost task 22.0 in stage 0.0 (TID 22, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException: 1
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:83)
at org.apache.spark.graphx.GraphLoader$$anonfun$1$$anonfun$apply$1.apply(GraphLoader.scala:76)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:76)
at org.apache.spark.graphx.GraphLoader$$anonfun$1.apply(GraphLoader.scala:74)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.RDD$$anonfun$15.apply(RDD.scala:631)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:245)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/04/30 22:52:47 WARN TaskSetManager: Lost task 22.2 in stage 0.0 (TID 24, host-192-168-0-18.openstacklocal): java.lang.ArrayIndexOutOfBoundsException