Pyspark Spark Drops重复,但选择具有null的列
我有一张这样的桌子:Pyspark Spark Drops重复,但选择具有null的列,pyspark,apache-spark-sql,Pyspark,Apache Spark Sql,我有一张这样的桌子: +---------+-------------+--------------+-----------+--------+--------------+--------------+ | cust_num|valid_from_dt|valid_until_dt|cust_row_id| cust_id|insert_load_dt|update_load_dt| +---------+-------------+--------------+-----------+---
+---------+-------------+--------------+-----------+--------+--------------+--------------+
| cust_num|valid_from_dt|valid_until_dt|cust_row_id| cust_id|insert_load_dt|update_load_dt|
+---------+-------------+--------------+-----------+--------+--------------+--------------+
|950379405| 2018-08-24| 2018-08-24| 06885247|06885247| 2018-08-24| 2018-08-25|
|950379405| 2018-08-25| 2018-08-28| 06885247|06885247| 2018-08-25| 2018-08-29|
|950379405| 2018-08-29| 2019-12-16| 27344328|06885247| 2018-08-29| 2019-12-17|<- pair 1
|950379405| 2018-08-29| 2019-12-16| 27344328|06885247| 2018-08-29| |<- pair 1
|950379405| 2019-12-17| 2019-12-24| 91778710|06885247| 2019-12-17| |<- pair 2
|950379405| 2019-12-17| 2019-12-24| 91778710|06885247| 2019-12-17| 2019-12-25|<- pair 2
|950379405| 2019-12-25| 2019-12-25| 08396180|06885247| 2019-12-25| 2019-12-26|<- pair 3
|950379405| 2019-12-25| 2019-12-25| 08396180|06885247| 2019-12-25| |<- pair 3
但我想保留更多的信息我的意思是我想保留行
where update\u load\u dt'
是否可以修改dropduplicates()函数,以便从重复项中选择要选择的行?或者有其他(更好的)方法吗?您可以使用窗口功能。不过,大数据可能会很慢
import pyspark.sql.function as F
from pyspark.sql.window import Window
df.withColumn("row_number", F.row_number().over(Window.partitionBy(<cols>).orderBy(F.asc_null_last("update_load_dt"))))
.filter("row_number = 1")
.drop("row_number") # optional
将pyspark.sql.function导入为F
从pyspark.sql.window导入窗口
df.withColumn(“行号”,F.行号().over(Window.partitionBy().orderBy(F.asc\u null\u last(“更新加载”)))
.filter(“行数=1”)
.drop(“行号”)#可选
我将这样做,F.max()将执行您想要的操作,并保持行具有最高值。(on date col max()保留最新的日期条目(如果有多个)
我处理超过10亿行的数据,这并不慢。
让我知道这是否有用
import pyspark.sql.function as F
from pyspark.sql.window import Window
df.withColumn("row_number", F.row_number().over(Window.partitionBy(<cols>).orderBy(F.asc_null_last("update_load_dt"))))
.filter("row_number = 1")
.drop("row_number") # optional
from pyspark.sql.window import Window
key_cols = ['cust_num','valid_from_dt','valid_until_dt','cust_row_id','cust_id']
w = Window.partitionBy(key_cols)
df.withColumn('update_load_dt', F.max('update_load_dt').over(w)).dropDuplicates(key_cols)