Dataframe Pyspark:基于所有列减去/差分Pyspark数据帧

Dataframe Pyspark:基于所有列减去/差分Pyspark数据帧,dataframe,pyspark,Dataframe,Pyspark,我有两个pyspark数据帧,如下所示- df1 df2 我想根据所有列值找出df2中存在的行,而不是df1中存在的行。因此,df2-df1应该产生如下的df_结果 df_结果 id city country region continent 3 Paris France EU EU 5 London UK EU EU 如何在pyspark中实现它。提前感谢

我有两个pyspark数据帧,如下所示-

df1

df2

我想根据所有列值找出df2中存在的行,而不是df1中存在的行。因此,df2-df1应该产生如下的df_结果

df_结果

id     city      country       region    continent
3      Paris      France       EU         EU
5      London     UK           EU         EU

如何在pyspark中实现它。提前感谢

您可以使用
左反
加入:

df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()

+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+
如果所有列都具有非空值:

df2.join(df1, on = df2.schema.names, how = "left_anti").show()

另一个简单的解决方案是使用
exceptAll()
函数。医生说-

返回一个新的SparkDataFrame,其中包含此SparkDataFrame中的行,但不包含另一个SparkDataFrame中的行,同时保留重复项。这相当于SQL中的ALL。作为SQL中的标准,此函数按位置(而不是名称)解析列

在这里创建DF
df_a

+---+-------+---------+------+---------+
|id |city   |country  |region|continent|
+---+-------+---------+------+---------+
|1  |chicago|USA      |NA    |NA       |
|2  |houston|USA      |NA    |NA       |
|3  |Sydney |Australia|AU    |AU       |
|4  |London |UK       |EU    |EU       |
+---+-------+---------+------+---------+
+---+-------+-------+------+---------+
|id |city   |country|region|continent|
+---+-------+-------+------+---------+
|1  |chicago|USA    |NA    |NA       |
|2  |houston|USA    |NA    |NA       |
|3  |Paris  |France |EU    |EU       |
|5  |London |UK     |EU    |EU       |
+---+-------+-------+------+---------+
df_b

+---+-------+---------+------+---------+
|id |city   |country  |region|continent|
+---+-------+---------+------+---------+
|1  |chicago|USA      |NA    |NA       |
|2  |houston|USA      |NA    |NA       |
|3  |Sydney |Australia|AU    |AU       |
|4  |London |UK       |EU    |EU       |
+---+-------+---------+------+---------+
+---+-------+-------+------+---------+
|id |city   |country|region|continent|
+---+-------+-------+------+---------+
|1  |chicago|USA    |NA    |NA       |
|2  |houston|USA    |NA    |NA       |
|3  |Paris  |France |EU    |EU       |
|5  |London |UK     |EU    |EU       |
+---+-------+-------+------+---------+
最终产量
+---+-------+-------+------+---------+
|id |city   |country|region|continent|
+---+-------+-------+------+---------+
|1  |chicago|USA    |NA    |NA       |
|2  |houston|USA    |NA    |NA       |
|3  |Paris  |France |EU    |EU       |
|5  |London |UK     |EU    |EU       |
+---+-------+-------+------+---------+
df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id|  city|country|region|continent|
+---+------+-------+------+---------+
|  3| Paris| France|    EU|       EU|
|  5|London|     UK|    EU|       EU|
+---+------+-------+------+---------+