获取pyspark中与最新时间戳对应的行

获取pyspark中与最新时间戳对应的行,pyspark,cassandra,Pyspark,Cassandra,我有一个数据帧: +--------------+-----------------+-------------------+ | ecid| creation_user| creation_timestamp| +--------------+-----------------+-------------------+ |ECID-195000300|USER_ID1 |2018-08-31 20:00:00| |ECID-195000300|USER

我有一个数据帧:

+--------------+-----------------+-------------------+
|          ecid|    creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID1          |2018-08-31 20:00:00|
|ECID-195000300|USER_ID2          |2016-08-31 20:00:00|
我需要一行最早的时间戳为:

+--------------+-----------------+-------------------+
    |          ecid|    creation_user| creation_timestamp|
    +--------------+-----------------+-------------------+
    |ECID-195000300|USER_ID2          |2016-08-31 20:00:00|
我如何在pyspark中实现这一点: 我试过了


然而,我只是得到了ecid和timestamp字段。我想要所有的字段,而不仅仅是两个字段,我想你需要一个
窗口
函数+一个过滤器。我可以向您推荐以下未经测试的解决方案:

将pyspark.sql.window导入为psw
将pyspark.sql.functions作为psf导入
w=psw.Window.partitionBy('ecid'))
df=(df.withColumn(“min_tmp”),psf.min(“creation_timestamp”)。超过(w))
.filter(psf.col(“min_tmp”)==psf.col(“创建时间戳”))
)

窗口
函数允许您在每个
ecid
上返回
min
,作为
数据帧的新列

使用window
行数
函数,在
ecid
上使用分区和在
创建时间戳
上使用排序依据

示例:

#sampledata

df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])

from pyspark.sql import Window
from pyspark.sql.functions import *

w = Window.partitionBy('ecid').orderBy("creation_timestamp")

df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
#+--------------+-------------+-------------------+
#|          ecid|creation_user| creation_timestamp|
#+--------------+-------------+-------------------+
#|ECID-195000300|     USER_ID2|2016-08-31 20:00:00|
#+--------------+-------------+-------------------+
#sampledata

df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])

from pyspark.sql import Window
from pyspark.sql.functions import *

w = Window.partitionBy('ecid').orderBy("creation_timestamp")

df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
#+--------------+-------------+-------------------+
#|          ecid|creation_user| creation_timestamp|
#+--------------+-------------+-------------------+
#|ECID-195000300|     USER_ID2|2016-08-31 20:00:00|
#+--------------+-------------+-------------------+