在Scala中,如何加入2 RDD

在Scala中,如何加入2 RDD,scala,apache-spark,Scala,Apache Spark,如果我将2个RDD定义为: Sample(Key1,EventDate,Value1) Sample2(Key1,ExecutionDate, Label1) 我想加入两个RDD,这样我就可以确定Key1是否存在于Sample2中,然后将完整的结果分离为两个新的RDD:1包含Key1存在于Sample2中的RDD,另一个将包含所有Key1,而它不存在于Sample2中 FoundKey1(Key1, EventDate,Value1) NotFoundKey1(Key1, Executi

如果我将2个RDD定义为:

Sample(Key1,EventDate,Value1) 
Sample2(Key1,ExecutionDate, Label1) 
我想加入两个RDD,这样我就可以确定Key1是否存在于Sample2中,然后将完整的结果分离为两个新的RDD:1包含Key1存在于Sample2中的RDD,另一个将包含所有Key1,而它不存在于Sample2中

FoundKey1(Key1, EventDate,Value1) 
NotFoundKey1(Key1, ExecutionDate,Label1)
本质上,我希望得到类似于SQL的东西:

 Select Sample.Key1, Sample.EventDate. Key1.Value
 from Sample
 where NOT EXISTS (select 1 from Sample2 where Sample2.Key1 = Sample.Key1) 
另一张桌子呢

 SELECT Sample.Key1, Sample.EventDate, Sample.Value1
 from Sample right join Sample2
 on (Sample.Key1 = Sample2.Key2);
样本RDD值:

  Sample(1, 2016-01-05, 10)
  Sample(1, 2016-01-05, 10)
  Sample(2, 2016-01-05, 10)
  Sample(2, 2016-01-05, 10)
  Sample(3, 2016-01-05, 10)

  Sample(1, 2016-01-05, A)
  Sample(3, 2016-01-05, A)
  Sample(5, 2016-01-05, B)
  Sample(6, 2016-01-05, C)
  Sample(7, 2016-01-05, C)
在我忘记之前,我的RDD被定义为RDD[Iterable[TestData]],TestData是一个类,其值(Key1,EventDate,value)为Sample,TestData2=(Key1,ExecutionDate,Label)

以下是我迄今为止所尝试的:

  val grpSample.groupBy(_.Key1).map(_._2)
  val grpSample2.groupBy(_.Key2).map(_._2)
  val interSect = grpSample.intersection.grpSample2

我运行此代码以查看是否正在对其进行分组,并收到一个错误

请向我们展示您迄今为止所做的尝试…最好将它们转换为DataFrame,即spark sql,然后根据您的condition@Akashi.. 有点新鲜。。那么,当你说转换为数据帧时,我该如何实现呢?我该如何转换成为result1:org.apache.spark.rdd.rdd[Iterable[TestData]]=MapPartitionsRDD[47]在map at:58 tempresult:org.apache.spark.rdd.rdd[(String,(Option[Iterable[TesData]],Option[Iterable[TestDat2]])]=MapPartitionsRDD[50]在fullOuterJoin:57 result2:org.apache.spark.rdd.rdd[Option[Iterable[TestData]]=MapPartitionsRDD[52]在map:59,这样我就可以访问原始的(Key1,EventDate,Value)try,result2=tempresult filter(._2._2.isEmpty)映射(._2._1.get)你觉得解决方案有用吗?
val rdd1=sample.groupBy(_.Key1)
val rdd2=sample2.groupBy(_.key1)

//to get data for which key exists in both rdd
val result1= rdd1 join rdd2 map (_._2)

//to get data for which key exists in first but not in second rdd
val tempresult= rdd1 fullOuterJoin rdd2
val result2= tempresult filter(_._2._2.isEmpty) map (_._2._1.get)