Dataframe Pyspark:基于所有列减去/差分Pyspark数据帧
我有两个pyspark数据帧,如下所示- df1 df2 我想根据所有列值找出df2中存在的行,而不是df1中存在的行。因此,df2-df1应该产生如下的df_结果 df_结果Dataframe Pyspark:基于所有列减去/差分Pyspark数据帧,dataframe,pyspark,Dataframe,Pyspark,我有两个pyspark数据帧,如下所示- df1 df2 我想根据所有列值找出df2中存在的行,而不是df1中存在的行。因此,df2-df1应该产生如下的df_结果 df_结果 id city country region continent 3 Paris France EU EU 5 London UK EU EU 如何在pyspark中实现它。提前感谢
id city country region continent
3 Paris France EU EU
5 London UK EU EU
如何在pyspark中实现它。提前感谢您可以使用
左反
加入:
df2.join(df1, on = ["id", "city", "country"], how = "left_anti").show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+
如果所有列都具有非空值:
df2.join(df1, on = df2.schema.names, how = "left_anti").show()
另一个简单的解决方案是使用
exceptAll()
函数。医生说-
返回一个新的SparkDataFrame,其中包含此SparkDataFrame中的行,但不包含另一个SparkDataFrame中的行,同时保留重复项。这相当于SQL中的ALL。作为SQL中的标准,此函数按位置(而不是名称)解析列
在这里创建DF
df_a
+---+-------+---------+------+---------+
|id |city |country |region|continent|
+---+-------+---------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Sydney |Australia|AU |AU |
|4 |London |UK |EU |EU |
+---+-------+---------+------+---------+
+---+-------+-------+------+---------+
|id |city |country|region|continent|
+---+-------+-------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Paris |France |EU |EU |
|5 |London |UK |EU |EU |
+---+-------+-------+------+---------+
df_b
+---+-------+---------+------+---------+
|id |city |country |region|continent|
+---+-------+---------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Sydney |Australia|AU |AU |
|4 |London |UK |EU |EU |
+---+-------+---------+------+---------+
+---+-------+-------+------+---------+
|id |city |country|region|continent|
+---+-------+-------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Paris |France |EU |EU |
|5 |London |UK |EU |EU |
+---+-------+-------+------+---------+
最终产量
+---+-------+-------+------+---------+
|id |city |country|region|continent|
+---+-------+-------+------+---------+
|1 |chicago|USA |NA |NA |
|2 |houston|USA |NA |NA |
|3 |Paris |France |EU |EU |
|5 |London |UK |EU |EU |
+---+-------+-------+------+---------+
df_final = df_b.exceptAll(df_a)
df_final.show()
+---+------+-------+------+---------+
| id| city|country|region|continent|
+---+------+-------+------+---------+
| 3| Paris| France| EU| EU|
| 5|London| UK| EU| EU|
+---+------+-------+------+---------+