Scala 在Spark和GraphX中，使用非持久图连接两个具有不同索引的顶点表示速度较慢_Scala_Apache Spark

Scala 在Spark和GraphX中，使用非持久图连接两个具有不同索引的顶点表示速度较慢

scala apache-spark

Scala 在Spark和GraphX中，使用非持久图连接两个具有不同索引的顶点表示速度较慢,scala,apache-spark,Scala,Apache Spark,很抱歉标题不准确且冗长，如果你能理解我的意思，请帮我编辑，谢谢代码如下。如果你执行它，你会得到 14/06/12 14:33:24 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow. 但是如果您对graph.unpersistVertices（blocking=false）进行注释，则不会出现此类警告。所以我很好奇为什么这会改变Graph对象的索引 ob

很抱歉标题不准确且冗长，如果你能理解我的意思，请帮我编辑，谢谢

代码如下。如果你执行它，你会得到

14/06/12 14:33:24 WARN ShippableVertexPartitionOps: Joining two VertexPartitions with different indexes is slow.

但是如果您对graph.unpersistVertices（blocking=false）进行注释，则不会出现此类警告。所以我很好奇为什么这会改变

Graph

对象的索引

object Test {
  def main(args: Array[String]) {
    val conf = new SparkConf().setAppName("Test")
      .setMaster("local[4]")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
    val sc = new SparkContext(conf)

    val v: RDD[(VertexId, Int)] = sc.parallelize(Seq((0L,0),(1L,1),(2L,2)))
    val e: RDD[Edge[Int]] = sc.parallelize(Seq(Edge(0, 1, 0), Edge(0, 2, 0), Edge(1, 0, 0), Edge(2, 1, 0)))


    val g = Graph(v, e)

    def test(graph: Graph[Int, Int]) = {
      graph.cache()
      val ng = graph.outerJoinVertices(graph.outDegrees){
        (vid, vd, out) => (vd, out.getOrElse(vid, 0))
      }

      val f = ng.subgraph(epred = _.srcId != 0, vpred = (vid, vd) => vid != 0L)
      f.cache()
      graph.unpersistVertices(blocking = false)
      f
    }

    val f1 = test(g)

    println(f1.numVertices)

  }
}

据我所知，在GraphX的图形上执行操作（如

mapValue

）时，

RDD

（

VertexRDD

）的索引将被重用，以避免重新计算。当您执行类似子图的操作时，您仍然可以通过对其应用位掩码在某种程度上重用这些索引。

outerJoinVertices

是否进行了某种操作，因为它只修改RDD的值

此外，我

cache（）

在

unpersist

旧图形之前创建了新图形，因此我认为

unpersist

不会影响缓存的图形，因为我们已经缓存了它，但我错了

缓存和取消持久化如何工作？既然我没有实际加入分区，为什么它们会影响索引

更新：我查看了代码，

numVertices

实际上是一个map和reduce方法

partitionsRDD.map（u.size）。reduce（u+）

。因此连接发生在这一行。

在取消旧图的持久化之前，需要具体化新图。这是因为RDD转换是惰性操作，也就是说，spark在看到操作之前不会实际计算它们。有关更多信息，请参阅《spark编程指南》中的“RDD操作”：

因此，在测试函数中，只需在f.cache（）解决问题后添加一行代码：
f、顶点数//此操作将强制spark计算f并缓存它