使用Pyspark的当前行和下一行之间的持续时间

使用Pyspark的当前行和下一行之间的持续时间,pyspark,Pyspark,行程 id,timestamp 1008,2003-11-03 15:00:31 1008,2003-11-03 15:02:38 1008,2003-11-03 15:03:04 1008,2003-11-03 15:18:00 1009,2003-11-03 22:00:00 1009,2003-11-03 22:02:53 1009,2003-11-03 22:03:44 1009,2003-11-14 10:00:00 1009,2003-11-14 10:02:02 1009,200

行程

id,timestamp
1008,2003-11-03 15:00:31
1008,2003-11-03 15:02:38
1008,2003-11-03 15:03:04
1008,2003-11-03 15:18:00
1009,2003-11-03 22:00:00
1009,2003-11-03 22:02:53
1009,2003-11-03 22:03:44 
1009,2003-11-14 10:00:00
1009,2003-11-14 10:02:02
1009,2003-11-14 10:03:10
使用熊猫:

trip['time_diff'] = np.where(trip['id'] == trip['id'].shift(-1),
                             trip['timestamp'].shift(-1) - trip['timestamp']/1000,
                             None)

trip['time_diff'] = pd.to_numeric(trip['time_diff'])
我在Pyspark中做了这个操作,但没有任何效果,我用spark编程已经一周了,但是我仍然无法使用该窗口

from pyspark.sql.types import *
from pyspark.sql import window
from pyspark.sql import functions as F

my_window = Window.partition('id').orderBy('timestamp').rowsBetween(0, 1)

timeFmt = "yyyy-MM-dd HH:mm:ss"

time_diff = (F.unix_timestamp(trip.timestamp, format=timeFmt).cast("long")  - 
             F.unix_timestamp(trip.timestamp, format=timeFmt).over(my_window).cast("long")) 

trip = trip.withColumn('time_diff', time_diff)
我不知道这是不是一种方法!!如果没有,如何将此操作转换为Pyspark

结果应该如下所示

id, timestamp, diff_time
1008, 2003-11-03 15:00:31, 127
1008, 2003-11-03 15:02:38, 26
1008, 2003-11-03 15:03:04, 896
1008, 2003-11-03 15:18:00, None
1009, 2003-11-03 22:00:00, 173
1009, 2003-11-03 22:02:53, 51
1009, 2003-11-03 22:03:44, 956776
1009, 2003-11-14 10:00:00, .....
1009, 2003-11-14 10:02:02, .....
1009, 2003-11-14 10:03:10, .....

您可以使用
lead
功能计算时差。以下是您想要的:

val interdf = spark.sql("select id, timestamp, lead(timestamp) over (partition by id order by timestamp) as next_ts from data")
interdf.createOrReplaceTempView("interdf")
spark.sql("select id, timestamp, next_ts, unix_timestamp(next_ts) - unix_timestamp(timestamp) from interdf").show()
如果要避免spark sql,可以通过导入相关函数来实现

import org.apache.spark.sql.functions.lead
import org.apache.spark.sql.expressions.Window
val window = Window.partitionBy("id").orderBy("timestamp")
相关Python代码:

from pyspark.sql import Window
from pyspark.sql.functions import abs

window = Window.partitionBy("id").orderBy("timestamp")

diff = col("timestamp").cast("long") - lead("timestamp", 1).over(window).cast("long")
df = df.withColumn("diff", diff)
df = df.withColumn('diff', abs(df.diff))
结果:

谢谢您的回答,但我不明白为什么即使使用orderBy也不订购时间戳。而且我还欺骗了自己,让添加abs()有了积极的区别,我不知道这是否是最好的方法@adilblanco您的编辑被几个MOD拒绝,但我编辑了我的答案,以包含您的Python代码。很抱歉,我完全没有注意到您在使用Python。为了回答您的问题,时间戳确实是有序的,但是对于每个id,因为这是分区的基础。另外,我的代码有
lead(timestamp)-timestamp
。您的代码是
timestamp-lead(timestamp
),这就是为什么您会得到负值。