使用Scala在Spark中的图中添加新顶点
我在Scala中使用Spark。我想创建一个图并动态更新该图 我已使用以下代码完成此操作:使用Scala在Spark中的图中添加新顶点,scala,graph,apache-spark,spark-graphx,Scala,Graph,Apache Spark,Spark Graphx,我在Scala中使用Spark。我想创建一个图并动态更新该图 我已使用以下代码完成此操作: import org.apache.spark.SparkContext import org.apache.spark.SparkContext._ import org.apache.spark.SparkConf import org.apache.spark.graphx._ import org.apache.spark.rdd.RDD object firstgraph { def ad
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
object firstgraph {
def addVertex(
sc: SparkContext,
vertexRDD: RDD[(Long(String,Int))],
name: String,
age: Int,
counter:Long): RDD[(Long, (String, Int))] = {
val newVertexArray = Array((counter, (name, age)))
val newVertexRdd: RDD[(Long, (String, Int))] = sc.parallelize(newVertexArray)
newVertexRdd ++ vertexRDD
}
def main(args: Array[String]) {
val conf = new SparkConf().setMaster("local").setAppName("firstgraph")
val sc = new SparkContext(conf)
val vertexArray = Array(
(1L, ("Alice", 28)),
(2L, ("Bob", 27)),
(3L, ("Charlie", 65)),
(4L, ("David", 42)),
(5L, ("Ed", 55)),
(6L, ("Fran", 50)))
val edgeArray = Array(
Edge(2L, 1L, 7),
Edge(2L, 4L, 2),
Edge(3L, 2L, 4),
Edge(3L, 6L, 3),
Edge(4L, 1L, 1),
Edge(5L, 2L, 2),
Edge(5L, 3L, 8),
Edge(5L, 6L, 3))
var vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
var edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
var graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach {
case (id, (name, age)) => println(s"$name is $age")
}
var x = 0
var counter = 7L
var name = ""
var age = 0
while (x == 0) {
println("Enter Name")
name = Console.readLine
println("Enter age")
age = Console.readInt
vertexRDD = addVertex(sc, vertexRDD, name, age, counter)
graph = Graph(vertexRDD, edgeRDD)
graph.vertices.filter { case (id, (name, age)) => age > 30 }.collect.foreach {
case (id, (name, age)) => println(s"$name is $age")
}
counter = counter + 1
println("want to enter more node press 0 for yes and 1 for no ")
x = Console.readInt
}
}
}
该程序将在图形中添加一个新顶点,但每当插入新顶点时,都会反复计算图形。我希望在不重新计算图形的情况下执行此操作。Apache Spark RDD不是为细粒度更新而设计的。RDD上的所有操作都是关于更改整个RDD的
def addVertex(rdd: RDD[String], sc: SparkContext, session: String): Long = {
val defaultUser = (0, 0)
rdd.collect().foreach { x =>
{
val aVertex: RDD[(VertexId, (Int, Int))] = sc.parallelize(Array((x.toLong, (100, 100))))
gVertices = gVertices.union(aVertex)
}
}
inputGraph = Graph(gVertices, gEdges, defaultUser)
inputGraph.cache()
gVertices = inputGraph.vertices
gVertices.cache()
val count = gVertices.count
println(count);
return 1;
首先,我建议您重新思考您的方法,并尝试在设计RDD时使用RDD。例如,许多通用算法都是为在一台机器上运行而设计的。就像快速排序一样。通过在每个步骤中仅交换两个元素,无法在RDD上实现快速排序。并行地做许多事情会浪费分布式系统的潜力。相反,您需要重新设计算法以利用并行性
这可能不适用于您的情况,并且您可能确实需要进行点更新,例如在您的示例中。在这种情况下,您最好使用不同的后端。HBase和Cassandra是为点更新而设计的,其他所有SQL和NoSQL数据库也是如此。如果需要图形数据库,Neo4j也是如此
但在离开Spark之前要检查的最后一件事是。它是一种RDD,专为点更新而设计。它是GraphX的一部分,因此可能非常适合您的情况。请尝试以下代码,将一组顶点添加到现有图形中。此处inputGraph是我的现有图形,它预定义为全局变量,并在使用其他函数之前创建。这段代码只会向其添加顶点。这里的rdd变量是我的集合,它的值被转换为Long,用作顶点ID并添加到图形中
def addVertex(rdd: RDD[String], sc: SparkContext, session: String): Long = {
val defaultUser = (0, 0)
rdd.collect().foreach { x =>
{
val aVertex: RDD[(VertexId, (Int, Int))] = sc.parallelize(Array((x.toLong, (100, 100))))
gVertices = gVertices.union(aVertex)
}
}
inputGraph = Graph(gVertices, gEdges, defaultUser)
inputGraph.cache()
gVertices = inputGraph.vertices
gVertices.cache()
val count = gVertices.count
println(count);
return 1;
}graphx在封面下使用
RDD
。您应该使用persist
或其别名cache
来避免重新计算。