Scala 如何在相邻顶点类型上过滤混合节点图_Scala_Graph_Spark Graphx

Scala 如何在相邻顶点类型上过滤混合节点图

scala graph

Scala 如何在相邻顶点类型上过滤混合节点图,scala,graph,spark-graphx,Scala,Graph,Spark Graphx,这个问题是关于Spark GraphX的。我想通过删除与某些其他节点相邻的节点来计算子图示例 [任务]保留不是C2节点邻居的A节点和B节点输入图形： ┌────┐ ┌─────│ A │──────┐ │ └────┘ │ v v ┌────┐ ┌────┐ ┌────┐ ┌───

这个问题是关于Spark GraphX的。我想通过删除与某些其他节点相邻的节点来计算子图

示例

[任务]保留不是C2节点邻居的A节点和B节点

输入图形：

                    ┌────┐
              ┌─────│ A  │──────┐
              │     └────┘      │
              v                 v
┌────┐     ┌────┐            ┌────┐     ┌────┐
│ C1 │────>│ B  │            │ B  │<────│ C2 │
└────┘     └────┘            └────┘     └────┘
              ^                 ^
              │     ┌────┐      │
              └─────│ A  │──────┘
                    └────┘

         ┌────┐
   ┌─────│ A  │
   │     └────┘
   v           
┌────┐         
│ B  │         
└────┘         
   ^           
   │     ┌────┐
   └─────│ A  │
         └────┘

如何优雅地编写返回输出图的GraphX查询？

一个解决方案是使用三元组视图来标识作为C1节点邻居的B节点子集。接下来，将这些节点和一个节点合并。接下来，创建一个新图形：

// Step 1
// Compute the subset of B's that are neighbors with C1
val nodesBC1 = graph.triplets .
    filter {trip => trip.srcAttr == "C1"} .
    map {trip => (trip.dstId, trip.dstAttr)}

// Step 2    
// Union the subset B's with all the A's
val nodesAB = nodesBC1 .
    union(graph.vertices filter {case (id, label) => label == "A"})

// Step 3
// Create a graph using the subset nodes and all the original edges
// Remove nodes that have null values
val solution1 = Graph(nodesAB, graph.edges) .
    subgraph(vpred = {case(id, label) => label != null})

在步骤1中，我通过将三元组视图的dstID和DSTATT映射在一起，重新创建节点RDD（包含B节点）。不确定这对大型图形的效率有多高？

使用

graphhops.collectNeights

val nodesAB = graph.collectNeighbors(EdgeDirection.Either)
  .filter{case (vid,ns) => ! ns.map(_._2).contains("C2")}.map(_._1)
  .intersection(
    graph.vertices
      .filter{case (vid,attr) => ! attr.toString.startsWith("C") }.map(_._1)
  )

其余部分的工作方式与您的相同：

val solution1 = Graph(nodesAB, graph.edges) .
subgraph(vpred = {case(id, label) => label != null})

如果您想使用更具可伸缩性的数据帧，那么首先我们需要将nodesAB转换为数据帧：

val newNodes = sqlContext.createDataFrame(
  nodesAB,
  StructType(Array(StructField("newNode", LongType, false)))
)

您使用以下内容创建并创建了DataFrame：

val edgeDf = sqlContext.createDataFrame(
  graph.edges.map{edge => Row(edge.srcId, edge.dstId, edge.attr)}, 
  StructType(Array(
    StructField("srcId", LongType, false),
    StructField("dstId", LongType, false),
    StructField("attr", LongType, false)
  ))
)

然后，您可以执行此操作来创建没有子图的图形：

val solution1 = Graph(
  nodesAB, 
  edgeDf
  .join(newNodes, $"srcId" === $"newNode").select($"srcId", $"dstId", $"attr")
  .join(newNodes, $"dstId" === $"newNode")
  .rdd.map(row => Edge(row.getLong(0), row.getLong(1), row.getLong(2)))
)

这是另一个解决方案。此解决方案使用aggregateMessages将整数（1）发送到应从图中删除的B。生成的顶点集与图连接，随后的子图调用将从输出图中删除不需要的B

// Step 1: send the message (1) to vertices that should be removed   
val deleteMe = graph.aggregateMessages[Int](
    ctx => {
      if (ctx.dstAttr.equals("B") && ctx.srcAttr.equals("C")) {
        ctx.sendToDst(1) // 1 means delete, but number is not actually used
      }
    },
    (a,b) => a  // choose either message, they are all (1)
  )

  // Step 2: join vertex sets, original and deleteMe
  val joined = graph.outerJoinVertices(deleteMe) {
    (id, origValue, msgValue ) => msgValue match {
      case Some(number) => "deleteme"  // vertex received msg
      case None => origValue
    }
  }

  // Step 3: Remove nodes with domain = deleteme
  joined.subgraph(vpred = (id, data) => data.equals("deleteme"))

我正在考虑一种只使用一个中间删除标志的方法，例如“deleteme”，而不是同时使用1和“deleteme”。但是到目前为止，这是一个很好的方法。

Edge.attr

保存有用的东西吗？在玩了几个小时之后，我不确定有没有比你现在这样做更好的方法来识别要删除的边，让边创建

attr

值为

null

的顶点，然后使用

子图

进行最终的修剪，至少不能不使用

数据帧

或大的

RDD.cartesian

。酷。感谢您尝试@DavidGriffin:-）我喜欢您的解决方案，因为它使用了CollectNeights。谢谢。这一点出人意料地难以做到。有一件事我可能会有所不同，我最近学习了

RDD.cogroup

collectNeights

仅返回主节点的顶点ID，而不是

attr

。如果使用

cogroup

添加顶点属性，我可能能够避免代码中的

交叉点。然后我可以在第一个过滤器中过滤掉startsWith（“C”）
。最后我使用了第三种方法。使用aggregateMessages，我向所有应删除的dst顶点发送“删除我”消息。然后，我通过1）外部顶点和2）子图步骤将这些顶点从图中过滤出来。