Datetime 时间间隔内的Pyspark组数据帧_Datetime_Pyspark_Group By_Python Datetime_Pyspark Dataframes

Datetime 时间间隔内的Pyspark组数据帧

datetime pyspark

Datetime 时间间隔内的Pyspark组数据帧,datetime,pyspark,group-by,python-datetime,pyspark-dataframes,Datetime,Pyspark,Group By,Python Datetime,Pyspark Dataframes,我有一个PYSPARK数据帧，它被排序（“时间戳”和“ship”升序）：我想在数据框中添加一个名为“trip”的新列。行程定义为在数据帧中船舶记录开始后2小时内启航的船舶编号。如果在两小时内船号发生变化，则应在数据框列“trip”中添加新的行程号所需的输出如下所示： +----------------------+------+-------+ | timestamp | ship | trip | +----------------------+------+---

我有一个PYSPARK数据帧，它被排序（“时间戳”和“ship”升序）：

我想在数据框中添加一个名为“trip”的新列。行程定义为在数据帧中船舶记录开始后2小时内启航的船舶编号。如果在两小时内船号发生变化，则应在数据框列“trip”中添加新的行程号

所需的输出如下所示：

+----------------------+------+-------+
|        timestamp     | ship | trip  |
+----------------------+------+-------+
| 2018-08-01 06:01:00  |    1 |    1  | # start new ship number
| 2018-08-01 06:01:30  |    1 |    1  | # still within 2 hours of same ship number
| 2018-08-01 09:00:00  |    1 |    2  | # more than 2 hours of same ship number = new trip
| 2018-08-01 09:00:00  |    2 |    3  | # new ship number = new trip
| 2018-08-01 10:15:43  |    2 |    3  | # still within 2 hours of same ship number
| 2018-08-01 11:00:01  |    3 |    4  | # new ship number = new trip
| 2018-08-01 06:00:13  |    4 |    5  | # new ship number = new trip
| 2018-08-01 13:00:00  |    4 |    6  | # more than 2 hours of same ship number = new trip
| 2018-08-13 14:00:00  |    5 |    7  | # new ship number = new trip
| 2018-08-13 14:15:03  |    5 |    7  | # still within 2 hours of same ship number
| 2018-08-13 14:45:08  |    5 |    7  | # still within 2 hours of same ship number
| 2018-08-13 14:50:00  |    5 |    7  | # still within 2 hours of same ship number
+-----------------------------+-------+

在熊猫中，它将这样做：

dt_trip = 2 # time duration trip per ship (in hours)
total_time = df['timestamp'] - df.groupby('name')['timestamp'].transform('min')
trips = total_time.dt.total_seconds().fillna(0)//(dt_trip*3600)
df['trip'] = df.groupby(['name', trips]).ngroup()+1

在PYSPARK中如何实现这一点

使用窗口函数
，行数（）
，收集列表（）
，以及对创建的条件进行增量求和

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w1=Window().partitionBy("ship").orderBy(F.unix_timestamp("timestamp")).rangeBetween(-7199, Window.currentRow)
w2=Window().partitionBy("ship").orderBy("timestamp")
w3=Window().orderBy("ship","timestamp")

df.withColumn("trip", F.sum(F.when(F.row_number().over(w2)==1, F.lit(1))\
                       .when(F.size(F.collect_list("ship").over(w1))==1, F.lit(1))\
                       .otherwise(F.lit(0))).over(w3)).orderBy("ship","timestamp").show()

#+-------------------+----+----+
#|          timestamp|ship|trip|
#+-------------------+----+----+
#|2018-08-01 06:01:00|   1|   1|
#|2018-08-01 06:01:30|   1|   1|
#|2018-08-01 09:00:00|   1|   2|
#|2018-08-01 09:00:00|   2|   3|
#|2018-08-01 10:15:43|   2|   3|
#|2018-08-01 11:00:01|   3|   4|
#|2018-08-01 06:00:13|   4|   5|
#|2018-08-01 13:00:00|   4|   6|
#|2018-08-13 14:00:00|   5|   7|
#|2018-08-13 14:15:03|   5|   7|
#|2018-08-13 14:45:08|   5|   7|
#|2018-08-13 14:50:00|   5|   7|
#+-------------------+----+----+

谢谢，我明天再查。范围内的-7199是关于什么的？unix_时间戳是以秒为单位的时间戳，因此7200秒=2小时。窗口范围是从0到7199，即总共7200秒

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w1=Window().partitionBy("ship").orderBy(F.unix_timestamp("timestamp")).rangeBetween(-7199, Window.currentRow)
w2=Window().partitionBy("ship").orderBy("timestamp")
w3=Window().orderBy("ship","timestamp")

df.withColumn("trip", F.sum(F.when(F.row_number().over(w2)==1, F.lit(1))\
                       .when(F.size(F.collect_list("ship").over(w1))==1, F.lit(1))\
                       .otherwise(F.lit(0))).over(w3)).orderBy("ship","timestamp").show()

#+-------------------+----+----+
#|          timestamp|ship|trip|
#+-------------------+----+----+
#|2018-08-01 06:01:00|   1|   1|
#|2018-08-01 06:01:30|   1|   1|
#|2018-08-01 09:00:00|   1|   2|
#|2018-08-01 09:00:00|   2|   3|
#|2018-08-01 10:15:43|   2|   3|
#|2018-08-01 11:00:01|   3|   4|
#|2018-08-01 06:00:13|   4|   5|
#|2018-08-01 13:00:00|   4|   6|
#|2018-08-13 14:00:00|   5|   7|
#|2018-08-13 14:15:03|   5|   7|
#|2018-08-13 14:45:08|   5|   7|
#|2018-08-13 14:50:00|   5|   7|
#+-------------------+----+----+