Scala 如何将RDD[（字符串，Iterable[VertexId]）]转换为数据帧？_Scala_Apache Spark_Dataframe_Apache Spark Sql_Spark Graphx

Scala 如何将RDD[（字符串，Iterable[VertexId]）]转换为数据帧？

scala apache-spark dataframe

Scala 如何将RDD[（字符串，Iterable[VertexId]）]转换为数据帧？,scala,apache-spark,dataframe,apache-spark-sql,spark-graphx,Scala,Apache Spark,Dataframe,Apache Spark Sql,Spark Graphx,我已经从Graphx创建了一个RDD，它如下所示： val graph = GraphLoader.edgeListFile(spark.sparkContext, fileName) var s: VertexRDD[VertexId] = graph.connectedComponents().vertices val nodeGraph: RDD[(String, Iterable[VertexId])] = s.groupBy(_._2) map { case (x, y) =>

我已经从

Graphx

创建了一个

RDD

，它如下所示：

val graph = GraphLoader.edgeListFile(spark.sparkContext, fileName)
var s: VertexRDD[VertexId] = graph.connectedComponents().vertices

val nodeGraph: RDD[(String, Iterable[VertexId])] = s.groupBy(_._2) map { case (x, y) =>
  val rand = randomUUID().toString
  val clusterList: Iterable[VertexId] = y.map(_._1)
  (rand, clusterList)
}

col1        col2
abc-def11   1
abc-def11   2
abc-def11   3
abc-def11   4
def-aaa     10
def-aaa     11

nodeGraph

类型为

RDD[（String，Iterable[VertexId]）]

，其中的数据格式如下：

(abc-def11, Iterable(1,2,3,4)), 
(def-aaa, Iterable(10,11)), 
...

我现在要做的是用它创建一个数据帧，应该是这样的：

val graph = GraphLoader.edgeListFile(spark.sparkContext, fileName)
var s: VertexRDD[VertexId] = graph.connectedComponents().vertices

val nodeGraph: RDD[(String, Iterable[VertexId])] = s.groupBy(_._2) map { case (x, y) =>
  val rand = randomUUID().toString
  val clusterList: Iterable[VertexId] = y.map(_._1)
  (rand, clusterList)
}

col1        col2
abc-def11   1
abc-def11   2
abc-def11   3
abc-def11   4
def-aaa     10
def-aaa     11

如何在Spark中执行此操作？

首先，使用

toDF（）

，使用所需的列名将RDD转换为数据帧。首先将

Iterable[VertexId]

更改为

Seq[Long]

最容易做到这一点

import spark.implicits._
val df = nodeGraph.map(x => (x._1, x._2.map(_.toLong).toSeq)).toDF("col1", "col2")

请注意，这可以在创建

nodeGraph

以保存步骤时完成。接下来，使用

explode

函数展平数据帧

val df2 = df.withColumn("col2", explode($"col2"))

这将为您提供所需的输出。

我有一个问题，就是

。toSeq将是一个问题，如果Iterable有十亿条记录，它会爆炸吗？如果Iterable在调用“toSeq”时有十亿条记录，它会失败吗it@Aamir：Seq是Iterable的扩展，但我认为它不会对性能产生任何影响。转换背后的原因是Spark不提供任何内置编码器用于从Iterable转换。可以使用kyro添加，但不能直接添加（请参阅）。