Python 加入的数据帧上的筛选器在pyspark中不起作用_Python_Apache Spark_Pyspark

Python 加入的数据帧上的筛选器在pyspark中不起作用

python apache-spark pyspark

Python 加入的数据帧上的筛选器在pyspark中不起作用,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一个包含以下三列的数据框架学生证名字时间戳一个学生id有多行，其中包含不同的名称以及记录实际更新的时间戳。我想得到两个不同的数据帧唯一_数据（所有学生id的行以及该学生id的最新时间戳）重复的_数据（输入数据框中的所有行，上述唯一的_数据行除外）我有以下生成2个数据帧的代码 input_frame.show() +----------+----------+---------+ |student_id|name |timestamp| +----------+

我有一个包含以下三列的数据框架

学生证
名字
时间戳

一个学生id有多行，其中包含不同的名称以及记录实际更新的时间戳。我想得到两个不同的数据帧

唯一_数据（所有学生id的行以及该学生id的最新时间戳）
重复的_数据（输入数据框中的所有行，上述唯一的_数据行除外）

我有以下生成2个数据帧的代码

input_frame.show()
+----------+----------+---------+
|student_id|name      |timestamp|
+----------+----------+---------+
|        s1|testuser  |       t1|
|        s1|sampleuser|       t2|
|        s2|test123   |       t1|
|        s2|sample123 |       t2|
+----------+----------+---------+

# Assuming t2 > t1

unique_data = input_frame.sort(sf.desc(timestamp))drop_duplicates("student_id")
unique_data.show()
+----------+----------+---------+
|student_id|name      |timestamp|
+----------+----------+---------+
|        s1|sampleuser|       t2|
|        s2|sample123 |       t2|
+----------+----------+---------+

input_frame = input_frame.alias('input_frame')
unique_frame = unique_frame.alias('unique_frame')

joined_data = input_frame.join(unique_data, input_frame["student_id"] == unique_data["student_id"], how="left")
joined_data.show()
+----------+----------+---------+----------+----------+---------+
|student_id|name      |timestamp|student_id|name      |timestamp|
+----------+----------+---------+----------+----------+---------+
|        s1|testuser  |       t1|        s1|sampleuser|       t2|
|        s1|sampleuser|       t2|        s1|sampleuser|       t2|
|        s2|test123   |       t1|        s2|sample123 |       t2|
|        s2|sample123 |       t2|        s2|sample123 |       t2|
+----------+----------+---------+----------+----------+---------+



duplicate_data = joined_data.filter(input_frame["timestamp"] != unique_data["timestamp"]).select("input_frame.*")
duplicate_data.show()
+----------+----+---------+
|student_id|name|timestamp|
+----------+----+---------+
+----------+----+---------+

对于

unique_data[“timestamp”]

，如果您想获取整个列，spark不知道您在谈论哪一行。您可以执行以下操作：

duplicate_data=joined_data.filter（（joined_data.timestamp！=unique_data.collect（）[0]['timestamp']））

表示joined_data.timestamp不等于唯一_数据的第一行[row 0][timestamp]。或者，您可以遍历每一行唯一的_数据，并检查它们是否相等。

我们应该在筛选条件中使用别名，因为帧具有类似的列

from pyspark.sql import functions as sf

input_frame = input_frame.alias('input_frame')
unique_frame = unique_frame.alias('unique_frame')

duplicate_data = joined_data.filter(sf.col("input_frame.timestamp") != sf.col("unique_data.timestamp")).select("input_frame.*")
duplicate_data.show()
+----------+----------+---------+
|student_id|name      |timestamp|
+----------+----------+---------+
|        s1|testuser  |       t1|
|        s2|test123   |       t1|
+----------+----------+---------+