Apache spark 使用SparkGraphComputer计算titan图上的顶点时抛出org.apache.spark.SparkException:作业因阶段失败而中止:
当尝试使用SparkGraphComputer计算簇上titan图的顶点数时,我遇到了一个错误,我不知道如何处理。我在代码中使用tinkerpop 3.1.1-Cubating和Titan 1.1.0-SNAPSHOT,并且在集群上安装了datastax community edition 2.1.11和spark 1.5.2-bin-hadoop2.6 我总结了一个简单的Java示例来重现我的问题:Apache spark 使用SparkGraphComputer计算titan图上的顶点时抛出org.apache.spark.SparkException:作业因阶段失败而中止:,apache-spark,titan,tinkerpop3,Apache Spark,Titan,Tinkerpop3,当尝试使用SparkGraphComputer计算簇上titan图的顶点数时,我遇到了一个错误,我不知道如何处理。我在代码中使用tinkerpop 3.1.1-Cubating和Titan 1.1.0-SNAPSHOT,并且在集群上安装了datastax community edition 2.1.11和spark 1.5.2-bin-hadoop2.6 我总结了一个简单的Java示例来重现我的问题: private void strippedDown() { // a normal t
private void strippedDown() {
// a normal titan cluster
String titanClusterConfig = "titan-cassandra-test-cluster.properties";
// a hadoop graph with cassandra as input and gryo as output
String sparkClusterConfig = "titan-cassandra-test-spark.properties";
String edgeLabel = "blank";
// add a graph
int n = 100;
Graph titanGraph = GraphFactory.open(titanClusterConfig);
Vertex superNode = titanGraph.addVertex(T.label, String.valueOf(0));
for (int i=1;i<n;i++) {
Vertex currentNode = titanGraph.addVertex(T.label, String.valueOf(i));
currentNode.addEdge(edgeLabel,superNode);
}
titanGraph.tx().commit();
//count with titan
Long count = titanGraph.traversal().V().count().next();
System.out.println("The number of vertices in the graph is: "+count);
// count the graph using titan graph computer
count = titanGraph.traversal(GraphTraversalSource.computer(FulgoraGraphComputer.class)).V().count().next();
System.out.println("The number of vertices in the graph is: "+count);
// count the graph using spark graph computer
Graph sparkGraph = GraphFactory.open(sparkClusterConfig);
count = sparkGraph.traversal(GraphTraversalSource.computer(SparkGraphComputer.class)).V().count().next();
System.out.println("The number of vertices in the graph is: "+count);
}
这会抛出org.apache.thrift.protocole异常:必填字段“keyspace”不存在!Struct:set_keyspace_args(keyspace:null)
两次,但完成并返回不正确的0
我知道邮件列表中有,但我在理解它或解决问题方面有困难。有人能告诉我发生了什么事,以及如何解决这个问题吗?我已将配置粘贴到下面
gremlin.graph=com.thinkaurelius.titan.core.TitanFactory
storage.backend=cassandrathrift
storage.hostname=node1
storage.cassandra.keyspace=mindmapstest
storage.cassandra.replication-factor=3
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5
及
编辑:此脚本将复制错误
graph = TitanFactory.open('titan-cassandra-test-cluster.properties')
superNode = graph.addVertex(T.label,"0")
for(i in 1..100) {
currentNode = graph.addVertex(T.label,i.toString())
currentNode.addEdge("blank",superNode)
}
graph.tx().commit()
graph.traversal().V().count()
graph.traversal(computer()).V().count()
sparkGraph = GraphFactory.open('titan-cassandra-test-spark.properties')
sparkGraph.traversal(computer(SparkGraphComputer)).V().count()
尝试将这些添加到HadoopGraph配置中
#
# Titan Cassandra InputFormat configuration
# see https://github.com/thinkaurelius/titan/blob/titan11/titan-hadoop-parent/titan-hadoop-core/src/main/java/com/thinkaurelius/titan/hadoop/formats/cassandra/CassandraBinaryInputFormat.java
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=node1,node2,node3
titanmr.ioformat.conf.storage.cassandra.keyspace=titan
titanmr.ioformat.cf-name=edgestore
#
# Apache Cassandra InputFormat configuration
# see https://github.com/apache/cassandra/blob/cassandra-2.2.3/src/java/org/apache/cassandra/hadoop/ConfigHelper.java
# see https://github.com/thinkaurelius/titan/blob/titan11/titan-hadoop-parent/titan-hadoop-core/src/main/java/com/thinkaurelius/titan/hadoop/formats/cassandra/CassandraBinaryInputFormat.java
# not clear why these need to be set manually when using cassandrathrift
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
我不是100%确定我将给出的理由是正确的,但我已经设法解决了这个问题。我的问题源于3个基本问题,它们都与配置有关
#
# Titan Cassandra InputFormat configuration
# see https://github.com/thinkaurelius/titan/blob/titan11/titan-hadoop-parent/titan-hadoop-core/src/main/java/com/thinkaurelius/titan/hadoop/formats/cassandra/CassandraBinaryInputFormat.java
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.hostname=node1,node2,node3
titanmr.ioformat.conf.storage.cassandra.keyspace=titan
titanmr.ioformat.cf-name=edgestore
#
# Apache Cassandra InputFormat configuration
# see https://github.com/apache/cassandra/blob/cassandra-2.2.3/src/java/org/apache/cassandra/hadoop/ConfigHelper.java
# see https://github.com/thinkaurelius/titan/blob/titan11/titan-hadoop-parent/titan-hadoop-core/src/main/java/com/thinkaurelius/titan/hadoop/formats/cassandra/CassandraBinaryInputFormat.java
# not clear why these need to be set manually when using cassandrathrift
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=titan
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
1) 第一个问题是由Jason友好地解决的,这与连接到Cassandra的正确配置选项有关-我仍然很好奇他们到底做了什么
2) 我无法成功运行java代码的原因是我没有正确设置HADOOP_GREMLIN_LIBS环境变量。由于这个原因,JAR没有被部署到集群以在图形计算机中使用。一旦设置好了,gremlin控制台和java示例也会遇到同样的问题——返回一个零计数
3) 零的计数是最难解决的。再次出现不理解手册的情况。有很多关于在我的集群上安装hadoop的参考文献,但是没有提到如何连接到集群上的hadoop。为了做到这一点,需要一个额外的配置选项fs.defaultFS
,它告诉gremlin在集群上哪里可以找到hadoop文件系统。一旦设置正确,顶点数就正确了
我的理论是,计算是正确执行的,但是当减少spark Worker的计数时,它们被保存在集群的某个地方,然后当将答案返回到控制台时,本地文件系统被查看,没有发现任何内容,因此返回零。这也许是一个错误
无论如何,我需要的最终配置文件是:
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=/test/output
####################################
# Cassandra Cluster Config #
####################################
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.cassandra.keyspace=mindmapstest
titanmr.ioformat.conf.storage.hostname=node1,node2,node3
titanmr.ioformat.cf-name=edgestore
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=spark://node1:7077
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer
####################################
# Apache Cassandra InputFormat configuration
####################################
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=mindmapstest
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
spark.eventLog.enabled=true
####################################
# Hadoop Cluster configuration #
####################################
fs.defaultFS=hdfs://node1:9000
谢谢你,杰森,一切都好了。你能告诉我那些配置在做什么吗?特别是谓词的长哈希…好吧,看起来它实际上不起作用。如果我在上面的编辑中将脚本提交给gremlin shell,它需要很长时间才能完成,并且总是返回0。您是否确保添加一个与
titanmr.ioformat.conf.storage.cassandra.keyspace
匹配的cassandra.input.keyspace
值?嗨,Jason,我终于弄明白了。感谢您提供额外的配置行,它们非常重要。我的最后一个问题是hadoop的配置。我将在自己的回答中详细说明这一点。我通常不会在属性文件中包含fs.defaultFS
,而是在启动gremlin.sh
之前在类路径中包含$HADOOP\u CONF\u DIR。然后,作为验证HDFS是否可访问的另一个预防措施,我在启动控制台后运行一个HDFS.ls()
,查看它是否列出了来自HDFS而不是本地文件系统的文件。关于cassandra.input.
属性,Titan中可能有一个bug需要解决,因为它似乎没有传播正确的配置属性。您可以参考两个源代码文件(ConfigHelper.java
和CassandraBinaryInputFormat.java
)来了解它们在做什么。特别是,谓词
是全范围节约片段查询谓词的序列化版本。
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=com.thinkaurelius.titan.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=/test/output
####################################
# Cassandra Cluster Config #
####################################
titanmr.ioformat.conf.storage.backend=cassandrathrift
titanmr.ioformat.conf.storage.cassandra.keyspace=mindmapstest
titanmr.ioformat.conf.storage.hostname=node1,node2,node3
titanmr.ioformat.cf-name=edgestore
####################################
# SparkGraphComputer Configuration #
####################################
spark.master=spark://node1:7077
spark.executor.memory=1g
spark.serializer=org.apache.spark.serializer.KryoSerializer
####################################
# Apache Cassandra InputFormat configuration
####################################
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=mindmapstest
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
spark.eventLog.enabled=true
####################################
# Hadoop Cluster configuration #
####################################
fs.defaultFS=hdfs://node1:9000