Dataframe 从具有左反联接的另一列中存在的列中删除具有值的行

Dataframe 从具有左反联接的另一列中存在的列中删除具有值的行,dataframe,apache-spark,apache-spark-sql,Dataframe,Apache Spark,Apache Spark Sql,df1: df2: df1基本上是df2,其中取消操作id为空。 我的目标是保留df1中的行,其中action\u id中的值不在cancellation\u action\u id中 期望输出: +--------------------+--------------+----------------------+ | event_id| action_id|cancellation_action_id| +--------------------+-------

df1:

df2:

df1基本上是df2,其中取消操作id为空。
我的目标是保留df1中的行,其中action\u id中的值不在cancellation\u action\u id中

期望输出:

+--------------------+--------------+----------------------+
|            event_id|     action_id|cancellation_action_id|
+--------------------+--------------+----------------------+
|a                   |actionIdUnique|                  null|
|b                   |   ActionId004|           ActionId002|
|c                   |   ActionId002|                  null|
+--------------------+--------------+----------------------+
因为他的动作id(ActionId002)等于事件id b中的取消动作id,所以偶数id c被删除。 如果您觉得有两种方法可以做到这一点:使用连接或窗口功能。
我尝试使用左反连接,但我不明白为什么连接的数据帧不等于预期的数据帧

+--------------------+--------------+----------------------+
|event_id            |action_id     |cancellation_action_id|
+--------------------+--------------+----------------------+
|a                   |actionIdUnique|null                  |   
+--------------------+--------------+----------------------+
我的结果是:

df3 = df1.join(df2, df1("action_id) === df2("cancellation_action_id") , "leftanti")
我不明白为什么最后一行没有删除。
两个数据帧都来自同一个数据帧,因此它们具有相同的模式。

这在spark中是已知的。我们通过以下操作来避免它:
df=df.toDF(*df.columns)
。您需要为每个要连接的帧执行此操作

下面是Python中的一个示例,但我认为使用scala也可以做到这一点:

+--------------------+--------------+----------------------+
|event_id            |action_id     |cancellation_action_id|
+--------------------+--------------+----------------------+
|a                   |actionIdUnique|null                  |
|c                   |ActionId002   |null                  |
+--------------------+--------------+----------------------+

这是由于列与空值的比较不正确造成的

In [90]: df = df.toDF(*df.columns)
In [91]: df.show()
+--------+--------------+----------------------+
|event_id|     action_id|cancellation_action_id|
+--------+--------------+----------------------+
|       a|ActionIdUnique|                  null|
|       b|   ActionId004|           ActionId002|
|       c|   ActionId002|                  null|
+--------+--------------+----------------------+

In [92]: df1 = df.filter(F.col('cancellation_action_id').isNull())
In [93]: df1 = df1.toDF(*df1.columns)
In [94]: df1.show()
+--------+--------------+----------------------+
|event_id|     action_id|cancellation_action_id|
+--------+--------------+----------------------+
|       a|ActionIdUnique|                  null|
|       c|   ActionId002|                  null|
+--------+--------------+----------------------+

In [95]: df_res = df1.join(df, df1['action_id'] == df['cancellation_action_id'], 'leftanti')
In [96]: df_res.show()
+--------+--------------+----------------------+
|event_id|     action_id|cancellation_action_id|
+--------+--------------+----------------------+
|       a|ActionIdUnique|                  null|
+--------+--------------+----------------------+
更多详情:

与Spark中的空值相比,不明显:

+--------------+--------+----------------------+
|     action_id|event_id|cancellation_action_id|
+--------------+--------+----------------------+
|actionIdUnique|       a|                  null|
+--------------+--------+----------------------+
但是

正确:

df2.filter(~(F.col('cancellation_action_id') == 'ActionId002')).show()

+---------+--------+----------------------+
|action_id|event_id|cancellation_action_id|
+---------+--------+----------------------+
+---------+--------+----------------------+
+--------------+--------+----------------------+
|     action_id|event_id|cancellation_action_id|
+--------------+--------+----------------------+
|actionIdUnique|       a|                  null|
+--------------+--------+----------------------+
df2.filter(F.col('cancellation_action_id') == 'ActionId002').show()

+-----------+--------+----------------------+
|  action_id|event_id|cancellation_action_id|
+-----------+--------+----------------------+
|ActionId004|       b|           ActionId002|
+-----------+--------+----------------------+
df2.filter(~(F.col('cancellation_action_id') == 'ActionId002')).show()

+---------+--------+----------------------+
|action_id|event_id|cancellation_action_id|
+---------+--------+----------------------+
+---------+--------+----------------------+
df2.filter(~F.col('cancellation_action_id').eqNullSafe('ActionId002')).show()

+--------------+--------+----------------------+
|     action_id|event_id|cancellation_action_id|
+--------------+--------+----------------------+
|actionIdUnique|       a|                  null|
|   ActionId002|       c|                  null|
+--------------+--------+----------------------+