获取pyspark中与最新时间戳对应的行
我有一个数据帧:获取pyspark中与最新时间戳对应的行,pyspark,cassandra,Pyspark,Cassandra,我有一个数据帧: +--------------+-----------------+-------------------+ | ecid| creation_user| creation_timestamp| +--------------+-----------------+-------------------+ |ECID-195000300|USER_ID1 |2018-08-31 20:00:00| |ECID-195000300|USER
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID1 |2018-08-31 20:00:00|
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
我需要一行最早的时间戳为:
+--------------+-----------------+-------------------+
| ecid| creation_user| creation_timestamp|
+--------------+-----------------+-------------------+
|ECID-195000300|USER_ID2 |2016-08-31 20:00:00|
我如何在pyspark中实现这一点:
我试过了
然而,我只是得到了ecid和timestamp字段。我想要所有的字段,而不仅仅是两个字段,我想你需要一个
窗口函数+一个过滤器。我可以向您推荐以下未经测试的解决方案:
将pyspark.sql.window导入为psw
将pyspark.sql.functions作为psf导入
w=psw.Window.partitionBy('ecid'))
df=(df.withColumn(“min_tmp”),psf.min(“creation_timestamp”)。超过(w))
.filter(psf.col(“min_tmp”)==psf.col(“创建时间戳”))
)
窗口
函数允许您在每个ecid
上返回min
,作为数据帧的新列使用window行数
函数,在ecid
上使用分区和在创建时间戳
上使用排序依据
示例:
#sampledata
df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('ecid').orderBy("creation_timestamp")
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
#+--------------+-------------+-------------------+
#| ecid|creation_user| creation_timestamp|
#+--------------+-------------+-------------------+
#|ECID-195000300| USER_ID2|2016-08-31 20:00:00|
#+--------------+-------------+-------------------+
#sampledata
df=spark.createDataFrame([("ECID-195000300","USER_ID1","2018-08-31 20:00:00"),("ECID-195000300","USER_ID2","2016-08-31 20:00:00")],["ecid","creation_user","creation_timestamp"])
from pyspark.sql import Window
from pyspark.sql.functions import *
w = Window.partitionBy('ecid').orderBy("creation_timestamp")
df.withColumn("rn",row_number().over(w)).filter(col("rn") ==1).drop("rn").show()
#+--------------+-------------+-------------------+
#| ecid|creation_user| creation_timestamp|
#+--------------+-------------+-------------------+
#|ECID-195000300| USER_ID2|2016-08-31 20:00:00|
#+--------------+-------------+-------------------+