Scala 将节点ID映射到边图形X

Scala 将节点ID映射到边图形X,scala,apache-spark,spark-graphx,Scala,Apache Spark,Spark Graphx,下面的代码为我提供了GraphX scala> val idNode = cleanwords.flatMap(x=>x).distinct.zipWithIndex.map{case (k, v) => (k, v.toLong)} nodesId: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[185] at map at <console>:32 scala> idNode.ta

下面的代码为我提供了
GraphX

scala> val idNode = cleanwords.flatMap(x=>x).distinct.zipWithIndex.map{case (k, v) => (k, v.toLong)}
nodesId: org.apache.spark.rdd.RDD[(String, Long)] = MapPartitionsRDD[185] at map at <console>:32

scala> idNode.take(5)
res97: Array[(String, Long)] = Array((cyber crimes,0), (cyber security,1), (india,2), (review,3), (civil society,4))
由于错误意味着存在与
键相关的重复值,这就是为什么我得到了迭代器,但我已经运行了
distinct
。那么现在如何摆脱它们呢? 此外,上面的解决方案对于更大的数据集是不可扩展的,因为我在这里使用了
collect

另一种选择是:

val edges2: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (idNode.filter(_._1 == x).collect.toMap.values, idNode.filter(_._1 == y).collect.toMap.values)}
这也不起作用

请任何人建议我如何构建这些节点和边,以便在
GraphX
中构建图形<代码>火花
我使用的版本是2.1.0


更新

能够找到不可扩展解决方案的解决方案:

scala> val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1(x), t1(y))} 
不要使用
t1。值(x)
使用
t1(x)
来解决错误

val edges2: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (idNode.filter(_._1 == x).collect.toMap.values, idNode.filter(_._1 == y).collect.toMap.values)}
scala> val edges: RDD[(VertexId, VertexId)] = edgeList.map{case Array(x: String, y: String) => (t1(x), t1(y))}