在Scala中,给定一定条件,如何组合两个数据帧中的行?
我有两个数据帧,称为: 表1在Scala中,给定一定条件,如何组合两个数据帧中的行?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个数据帧,称为: 表1 +---------+------------+------+ | Animal | Owner |count1| +---------+------------+------+ | Cat | Bob | 3 | | Fish | Jerry | 2 | | Dog | Bob | 2 | | Turtle | Joe | 5 | +---------+
+---------+------------+------+
| Animal | Owner |count1|
+---------+------------+------+
| Cat | Bob | 3 |
| Fish | Jerry | 2 |
| Dog | Bob | 2 |
| Turtle | Joe | 5 |
+---------+------------+------+
表2
+---------+------------+------+
| Animal | Owner |count2|
+---------+------------+------+
| Cat | Bob | 2 |
| Fish | Jerry | 1 |
| Dog | Bob | 3 |
| Snake | Kim | 6 |
+---------+------------+------+
我试图以某种方式组合这两个数据帧,以便下面的新数据帧将包含行
- 出现在“表1”或“表2”中的
- 其中,在两个表中找到的不同行包含的计数值在“table2”中大于在“table1”中
+---------+------------+------+------+
| Animal | Owner |count1|count2|
+---------+------------+------+------+
| Dog | Bob | 2 | 3 |
| Turtle | Joe | 5 | null |
| Snake | Kim | null | 6 |
+---------+------------+------+------+
出现在“table1”中而不在“table2”中的行(或出现在“table2”中而不在“table1”中的行)的计数值可以为“null”。在Spark中,请尝试使用
过滤器进行完全联接
scala> var t1 = Seq(("Cat","Bob",3), ("Fish" ,"Jerry" ,2), ("Dog" , "Bob",2), ("Turtle" ,"Joe",5)).toDF("Animal","Owner","count1")
scala> var t2 = Seq(("Cat", "Bob",2),("Fish","Jerry",1),("Dog" ,"Bob",3),("Snake","Kim",6)).toDF("Animal","Owner","count2")
在数据帧t1(表1)和t2(表2)中,应用完全联接
,同时保留表中两个计数列的空行
scala> t2.join(t1,Seq("Animal","Owner"),"full").filter(col("count2")>col("count1") || col("count2").isNull || col("count1").isNull).show
+------+-----+------+------+
|Animal|Owner|count2|count1|
+------+-----+------+------+
| Dog| Bob| 3| 2|
| Snake| Kim| 6| null|
|Turtle| Joe| null| 5|
+------+-----+------+------+