Scala 如何加入两个spark RDD_Scala_Apache Spark

Scala 如何加入两个spark RDD

scala apache-spark

Scala 如何加入两个spark RDD,scala,apache-spark,Scala,Apache Spark,我有两个spark RDD，第一个包含一些索引和ID（字符串）之间的映射，第二个包含相关索引的元组 val ids = spark.sparkContext.parallelize(Array[(Int, String)]( (1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"))).toDF("index", "idx") val relationships = spark.sparkContext.parallelize(Array[

我有两个spark RDD，第一个包含一些索引和ID（字符串）之间的映射，第二个包含相关索引的元组

val ids = spark.sparkContext.parallelize(Array[(Int, String)](
      (1, "a"), (2, "b"), (3, "c"), (4, "d"), (5, "e"))).toDF("index", "idx")


val relationships = spark.sparkContext.parallelize(Array[(Int, Int)](
  (1, 3), (2, 3), (4, 5))).toDF("index1", "index2")

我想以某种方式加入这些RDD（或merge或sql或任何最佳spark实践），以在最后拥有相关的ID：

组合RDD的结果应返回：

("a", "c"), ("b", "c"), ("d", "e")

您知道如何以最佳方式实现此操作而无需将任何RDD加载到内存映射中（因为在我的场景中，这些RDD可能会加载数百万条记录）

您可以通过从

数据帧创建两个视图来实现此目的，如下所示
relationships.createOrReplaceTempView("relationships");
ids.createOrReplaceTempView("ids");

接下来，运行以下SQL查询以生成所需结果，该结果在关系
和ids
视图之间执行内部联接以生成所需结果
import sqlContext.sql;
val result = spark.sql("""select t.index1, id.idx from 
                                (select id.idx as index1, rel.index2 
                               from relationships rel
                               inner join
                               ids id on rel.index1=id.index) t
                         inner join
                         ids id
                         on id.index=t.index2
                      """);

result.show()

另一种方法是使用DataFrame
而不创建视图

relationships.as("rel").
join(ids.as("ids"),  $"ids.index" === $"rel.index1").as("temp").
join(ids.as("ids"), $"temp.index2"===$"ids.index").
select($"temp.idx".as("index1"), $"ids.idx".as("index2")).show

谢谢我会试一试，看看这是否有效。或者，我想解决的原始问题：（）如果你有任何其他想法，我已经更新了其他帖子：如果你有任何想法