Python 分组数据上的滞后函数
我有一个数据框,如下所示:Python 分组数据上的滞后函数,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,我有一个数据框,如下所示: from pyspark.sql import functions as f from pyspark.sql.window import Window df = spark.createDataFrame([ {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:30:57.000",
from pyspark.sql import functions as f
from pyspark.sql.window import Window
df = spark.createDataFrame([
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:30:57.000", "Username": "user1", "Region": "US"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:31:57.014", "Username": "user2", "Region": "US"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:32:57.914", "Username": "user1", "Region": "MX"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:35:57.914", "Username": "user2", "Region": "CA"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:33:57.914", "Username": "user1", "Region": "UK"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:34:57.914", "Username": "user1", "Region": "GR"},
{"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:36:57.914", "Username": "user2", "Region": "IR"}])
w = Window.partitionBy().orderBy("groupId","Username").orderBy("Username","ts")
df2 = df.withColumn("prev_region", f.lag(df.Region).over(w))
白天
区域
用户名
groupId
ts
2021-01-27
我们
用户1
A.
2021-01-27 08:30:57.000
2021-01-27
MX
用户1
A.
2021-01-27 08:32:57.914
2021-01-27
英国
用户1
A.
2021-01-27 08:33:57.914
2021-01-27
GR
用户1
A.
2021-01-27 08:34:57.914
2021-01-27
我们
用户2
A.
2021-01-27 08:31:57.014
2021-01-27
加利福尼亚州
用户2
A.
2021-01-27 08:35:57.914
2021-01-27
红外光谱
用户2
A.
2021-01-27 08:36:57.914
您只需要在partitionBy函数中添加用户名列。另外,不需要有两个orderBy函数调用。将您的线路更改为:
w = Window.partitionBy('Username').orderBy("ts")
你就快到了
只需根据您的数据帧,按以下方式指定windows函数即可
Python API
>>>w=Window.partitionByUsername.orderBygroupId,用户名,ts
>>>df2.showtruncate=100
+-----+---+----+----+------------+------+
|日|地区|用户名|组ID | ts |上一地区|
+-----+---+----+----+------------+------+
|2021-01-27 |美国|用户1 | A | 2021-01-27 08:30:57.000 |空|
|2021-01-27 | MX | user1 | A | 2021-01-27 08:32:57.914 | US|
|2021-01-27 |英国|用户1 | A | 2021-01-27 08:33:57.914 | MX|
|2021-01-27 | GR | user1 | A | 2021-01-27 08:34:57.914 |英国|
|2021-01-27 |美国|用户2 | A | 2021-01-27 08:31:57.014 |空|
|2021-01-27 | CA | user2 | A | 2021-01-27 08:35:57.914 | US|
|2021-01-27 | IR | user2 | A | 2021-01-27 08:36:57.914 | CA|
+-----+---+----+----+------------+------+
SQL API
df.createOrReplaceTempViewdf
结果=spark.sql
选择
日期、地区、用户名、组ID、ts、,
按用户名划分的滞后区域按组ID、用户名、ts排序
来自df
结果:showtruncate=100
+-----+---+----+----+------------+--+
|日|地区|用户名|组ID | ts |排名|
+-----+---+----+----+------------+--+
|2021-01-27 |美国|用户1 | A | 2021-01-27 08:30:57.000 |空|
|2021-01-27 | MX | user1 | A | 2021-01-27 08:32:57.914 | US|
|2021-01-27 |英国|用户1 | A | 2021-01-27 08:33:57.914 | MX|
|2021-01-27 | GR | user1 | A | 2021-01-27 08:34:57.914 |英国|
|2021-01-27 |美国|用户2 | A | 2021-01-27 08:31:57.014 |空|
|2021-01-27 | CA | user2 | A | 2021-01-27 08:35:57.914 | US|
|2021-01-27 | IR | user2 | A | 2021-01-27 08:36:57.914 | CA|
+-----+---+----+----+------------+--+
如果有多个组和多个GroupID,则按如下说明窗口功能:
>>>w=Window.partitionBygroupId,Username.orderBygroupId,ts,Username