使用scala或spark sql从spark中的第一个表中选择不在第二个表中的值
我有一个样本蜂巢/火花表如下: 行键 数据\u作为\u日期的\u 钥匙 价值 A. 20210121 关键1 价值1 A. 20210121 键2 价值2 A. 20210121 键3 价值3 B 20210121 关键1 价值1 B 20210121 键2 价值1 B 20210121 键3 价值2 B 20210121 关键4 价值3 C 20210121 关键1 价值2使用scala或spark sql从spark中的第一个表中选择不在第二个表中的值,scala,apache-spark,Scala,Apache Spark,我有一个样本蜂巢/火花表如下: 行键 数据\u作为\u日期的\u 钥匙 价值 A. 20210121 关键1 价值1 A. 20210121 键2 价值2 A. 20210121 键3 价值3 B 20210121 关键1 价值1 B 20210121 键2 价值1 B 20210121 键3 价值2 B 20210121 关键4 价值3 C 20210121 关键1 价值2 您只需在两个数据帧上执行leftanti连接即可获得预期的输出 val df = Seq(("A"
您只需在两个数据帧上执行leftanti连接即可获得预期的输出
val df = Seq(("A","20210121","key1","value1"),("A","20210121","key2","value2"),("A","20210121","key3","value3"),("B","20210121","key1","value1"),("B","20210121","key2","value1"),("B","20210121","key3","value3"),("B","20210121","key4","value3"),("C","20210121","key1","value2"))
.toDF("row_key","data_as_of_date","key","value")
val df1 = Seq(("A","20210121","key1","value1"),("A","20210121","key2","value2"),("B","20210121","key1","value1"),("B","20210121","key4","value3"),("C","20210121","key1","value2"))
.toDF("row_key","data_as_of_date","key","value")
val outputdf = df.join(df1, Seq("row_key","data_as_of_date","key"),"leftanti")
display(outputdf)
您可以看到如下输出: