Scala 使用Spark从顶点创建边_Scala_Apache Spark_Spark Graphx

Scala 使用Spark从顶点创建边

scala apache-spark

Scala 使用Spark从顶点创建边,scala,apache-spark,spark-graphx,Scala,Apache Spark,Spark Graphx,假设我有一个顶点数组，我想用每个顶点连接到下一个x顶点的方式从它们创建边。 x可以有任何整数值。有没有办法用Spark做到这一点这就是我目前对Scala的看法： //array that holds the edges var edges = Array.empty[Edge[Double]] for(j <- 0 to vertices.size - 2) { for(i <- 1 to x) { if((j+i) < vert

假设我有一个顶点数组，我想用每个顶点连接到下一个x顶点的方式从它们创建边。 x可以有任何整数值。有没有办法用Spark做到这一点

这就是我目前对Scala的看法：

//array that holds the edges
    var edges = Array.empty[Edge[Double]]
    for(j <- 0 to vertices.size - 2) {
      for(i <- 1 to x) {
        if((j+i) < vertices.size) {
          //add edge
          edges = edges ++ Array(Edge(vertices(j)._1, vertices(j+i)._1, 1.0))
          //add inverse edge, we want both directions
          edges = edges ++ Array(Edge(vertices(j+i)._1, vertices(j)._1, 1.0))
        }
      }
    }

//保存边的数组
var edges=Array.empty[Edge[Double]]
对于（你好

，

你好->和

，

和->你好

，

你好

行星，

行星

，

世界->和

，

和->世界

，

世界-/code>行星

，

世界-/code>宇宙

，

世界

，等等。

你是指s吗像这样的东西

// Add dummy vertices at the end (assumes that you don't use negative ids)
(vertices ++ Array.fill(n)((-1L, null))) 
  .sliding(n + 1) // Slide over n + 1 vertices at the time
  .flatMap(arr => { 
     val (srcId, _) = arr.head // Take first
     // Generate 2n edges
     arr.tail.flatMap{case (dstId, _) => 
       Array(Edge(srcId, dstId, 1.0), Edge(dstId, srcId, 1.0))
     }}.filter(e => e.srcId != -1L & e.dstId != -1L)) // Drop dummies
  .toArray

如果要在RDD上运行，只需调整如下初始步骤：

import org.apache.spark.mllib.rdd.RDDFunctions._

val nPartitions = vertices.partitions.size - 1

vertices.mapPartitionsWithIndex((i, iter) =>
  if (i == nPartitions) (iter ++ Array.fill(n)((-1L, null))).toIterator
  else iter)

当然，把

toArray

。如果你想要圆形连接（尾部连接到头部），你可以替换

数组。用顶点填充（n）（-1L，null））
。取（n）

，然后把

过滤器放下，我想这会得到你想要的：
首先，我定义了一个小辅助函数（注意，我在这里将边数据设置为顶点名称，以便更容易进行视觉检查）：
我在您的数组上执行zipWithIndex
以获取密钥，然后将数组转换为RDD：
val vertices = List((1L,"hello"), (2L,"world"), (3L,"and"), (4L, "planet"), (5L,"cosmos")).toArray
val indexedVertices = vertices.zipWithIndex
val rdd = sc.parallelize(indexedVertices)

然后用x=3
生成边：
val edges = rdd
  .flatMap{case((vertexId, name), index) => for {i <- 0 to 3; if (index - i) >= 0} yield ((index - i, (vertexId, name)))}
  .groupByKey()
  .flatMap{case(index, iterable) => pairwiseEdges(iterable.toList)}
  .distinct()

我希望您不介意一些建议：1）对于理解，可以涵盖第一个平面图和过滤器，以获得{I=0}收益（（index-I，（vertexId，name））
，而不需要任何可变数据结构，2）如果您决定洗牌，那么使用RangePartitioner
分区可能是个好主意。它需要额外的传递数据，但大多数元组应该已经在正确的分区上，3）可以在RDD上使用zipWithIndex，但如果数据适合本地数组，则可能没有意义。在这种情况下，使用RDD进行处理并不会产生什么效果，但如果OP要求……）4） ListBuffer
是GenTraversableOnce
所以不需要toList@zero323一点也不介意这些建议，相反：）我会调整代码。顺便说一句，我同意如果数据适合于一个数组，那么使用rdd似乎很奇怪，但我很高兴看到使用rdd生成边是否确实可行：）
val edges = rdd
  .flatMap{case((vertexId, name), index) => for {i <- 0 to 3; if (index - i) >= 0} yield ((index - i, (vertexId, name)))}
  .groupByKey()
  .flatMap{case(index, iterable) => pairwiseEdges(iterable.toList)}
  .distinct()

Edge(1,2,hello--world))
Edge(1,3,hello--and))
Edge(1,4,hello--planet)

Edge(2,1,world--hello)
Edge(2,3,world--and)
Edge(2,4,world--planet)
Edge(2,5,world--cosmos)

Edge(3,1,and--hello)
Edge(3,2,and--world)
Edge(3,4,and--planet)
Edge(3,5,and--cosmos)

Edge(4,1,planet--hello)
Edge(4,2,planet--world)
Edge(4,3,planet--and)
Edge(4,5,planet--cosmos)

Edge(5,2,cosmos--world)
Edge(5,3,cosmos--and)
Edge(5,4,cosmos--planet)