Dataframe 从具有左反联接的另一列中存在的列中删除具有值的行
df1: df2: df1基本上是df2,其中取消操作id为空。Dataframe 从具有左反联接的另一列中存在的列中删除具有值的行,dataframe,apache-spark,apache-spark-sql,Dataframe,Apache Spark,Apache Spark Sql,df1: df2: df1基本上是df2,其中取消操作id为空。 我的目标是保留df1中的行,其中action\u id中的值不在cancellation\u action\u id中 期望输出: +--------------------+--------------+----------------------+ | event_id| action_id|cancellation_action_id| +--------------------+-------
我的目标是保留df1中的行,其中action\u id中的值不在cancellation\u action\u id中 期望输出:
+--------------------+--------------+----------------------+
| event_id| action_id|cancellation_action_id|
+--------------------+--------------+----------------------+
|a |actionIdUnique| null|
|b | ActionId004| ActionId002|
|c | ActionId002| null|
+--------------------+--------------+----------------------+
因为他的动作id(ActionId002)等于事件id b中的取消动作id,所以偶数id c被删除。
如果您觉得有两种方法可以做到这一点:使用连接或窗口功能。我尝试使用左反连接,但我不明白为什么连接的数据帧不等于预期的数据帧
+--------------------+--------------+----------------------+
|event_id |action_id |cancellation_action_id|
+--------------------+--------------+----------------------+
|a |actionIdUnique|null |
+--------------------+--------------+----------------------+
我的结果是:
df3 = df1.join(df2, df1("action_id) === df2("cancellation_action_id") , "leftanti")
我不明白为什么最后一行没有删除。两个数据帧都来自同一个数据帧,因此它们具有相同的模式。这在spark中是已知的。我们通过以下操作来避免它:
df=df.toDF(*df.columns)
。您需要为每个要连接的帧执行此操作
下面是Python中的一个示例,但我认为使用scala也可以做到这一点:
+--------------------+--------------+----------------------+
|event_id |action_id |cancellation_action_id|
+--------------------+--------------+----------------------+
|a |actionIdUnique|null |
|c |ActionId002 |null |
+--------------------+--------------+----------------------+
这是由于列与空值的比较不正确造成的
In [90]: df = df.toDF(*df.columns)
In [91]: df.show()
+--------+--------------+----------------------+
|event_id| action_id|cancellation_action_id|
+--------+--------------+----------------------+
| a|ActionIdUnique| null|
| b| ActionId004| ActionId002|
| c| ActionId002| null|
+--------+--------------+----------------------+
In [92]: df1 = df.filter(F.col('cancellation_action_id').isNull())
In [93]: df1 = df1.toDF(*df1.columns)
In [94]: df1.show()
+--------+--------------+----------------------+
|event_id| action_id|cancellation_action_id|
+--------+--------------+----------------------+
| a|ActionIdUnique| null|
| c| ActionId002| null|
+--------+--------------+----------------------+
In [95]: df_res = df1.join(df, df1['action_id'] == df['cancellation_action_id'], 'leftanti')
In [96]: df_res.show()
+--------+--------------+----------------------+
|event_id| action_id|cancellation_action_id|
+--------+--------------+----------------------+
| a|ActionIdUnique| null|
+--------+--------------+----------------------+
更多详情:
与Spark中的空值相比,不明显:
+--------------+--------+----------------------+
| action_id|event_id|cancellation_action_id|
+--------------+--------+----------------------+
|actionIdUnique| a| null|
+--------------+--------+----------------------+
但是
正确:
df2.filter(~(F.col('cancellation_action_id') == 'ActionId002')).show()
+---------+--------+----------------------+
|action_id|event_id|cancellation_action_id|
+---------+--------+----------------------+
+---------+--------+----------------------+
+--------------+--------+----------------------+
| action_id|event_id|cancellation_action_id|
+--------------+--------+----------------------+
|actionIdUnique| a| null|
+--------------+--------+----------------------+
df2.filter(F.col('cancellation_action_id') == 'ActionId002').show()
+-----------+--------+----------------------+
| action_id|event_id|cancellation_action_id|
+-----------+--------+----------------------+
|ActionId004| b| ActionId002|
+-----------+--------+----------------------+
df2.filter(~(F.col('cancellation_action_id') == 'ActionId002')).show()
+---------+--------+----------------------+
|action_id|event_id|cancellation_action_id|
+---------+--------+----------------------+
+---------+--------+----------------------+
df2.filter(~F.col('cancellation_action_id').eqNullSafe('ActionId002')).show()
+--------------+--------+----------------------+
| action_id|event_id|cancellation_action_id|
+--------------+--------+----------------------+
|actionIdUnique| a| null|
| ActionId002| c| null|
+--------------+--------+----------------------+