Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/291.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 分组数据上的滞后函数_Python_Dataframe_Apache Spark_Pyspark - Fatal编程技术网

Python 分组数据上的滞后函数

Python 分组数据上的滞后函数,python,dataframe,apache-spark,pyspark,Python,Dataframe,Apache Spark,Pyspark,我有一个数据框,如下所示: from pyspark.sql import functions as f from pyspark.sql.window import Window df = spark.createDataFrame([ {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:30:57.000",

我有一个数据框,如下所示:

from pyspark.sql import functions as f
from pyspark.sql.window import Window

df = spark.createDataFrame([
  {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:30:57.000", "Username": "user1", "Region": "US"},
  {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:31:57.014", "Username": "user2", "Region": "US"},
  {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:32:57.914", "Username": "user1", "Region": "MX"},
  {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:35:57.914", "Username": "user2", "Region": "CA"},
  {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:33:57.914", "Username": "user1", "Region": "UK"},
  {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:34:57.914", "Username": "user1", "Region": "GR"},
  {"groupId":"A","Day":"2021-01-27", "ts": "2021-01-27 08:36:57.914", "Username": "user2", "Region": "IR"}])

w = Window.partitionBy().orderBy("groupId","Username").orderBy("Username","ts")
df2 = df.withColumn("prev_region", f.lag(df.Region).over(w))
白天 区域 用户名 groupId ts 2021-01-27 我们 用户1 A. 2021-01-27 08:30:57.000 2021-01-27 MX 用户1 A. 2021-01-27 08:32:57.914 2021-01-27 英国 用户1 A. 2021-01-27 08:33:57.914 2021-01-27 GR 用户1 A. 2021-01-27 08:34:57.914 2021-01-27 我们 用户2 A. 2021-01-27 08:31:57.014 2021-01-27 加利福尼亚州 用户2 A. 2021-01-27 08:35:57.914 2021-01-27 红外光谱 用户2 A. 2021-01-27 08:36:57.914 您只需要在partitionBy函数中添加用户名列。另外,不需要有两个orderBy函数调用。将您的线路更改为:

w = Window.partitionBy('Username').orderBy("ts")
你就快到了

只需根据您的数据帧,按以下方式指定windows函数即可

Python API >>>w=Window.partitionByUsername.orderBygroupId,用户名,ts >>>df2.showtruncate=100 +-----+---+----+----+------------+------+ |日|地区|用户名|组ID | ts |上一地区| +-----+---+----+----+------------+------+ |2021-01-27 |美国|用户1 | A | 2021-01-27 08:30:57.000 |空| |2021-01-27 | MX | user1 | A | 2021-01-27 08:32:57.914 | US| |2021-01-27 |英国|用户1 | A | 2021-01-27 08:33:57.914 | MX| |2021-01-27 | GR | user1 | A | 2021-01-27 08:34:57.914 |英国| |2021-01-27 |美国|用户2 | A | 2021-01-27 08:31:57.014 |空| |2021-01-27 | CA | user2 | A | 2021-01-27 08:35:57.914 | US| |2021-01-27 | IR | user2 | A | 2021-01-27 08:36:57.914 | CA| +-----+---+----+----+------------+------+ SQL API df.createOrReplaceTempViewdf 结果=spark.sql 选择 日期、地区、用户名、组ID、ts、, 按用户名划分的滞后区域按组ID、用户名、ts排序 来自df 结果:showtruncate=100 +-----+---+----+----+------------+--+ |日|地区|用户名|组ID | ts |排名| +-----+---+----+----+------------+--+ |2021-01-27 |美国|用户1 | A | 2021-01-27 08:30:57.000 |空| |2021-01-27 | MX | user1 | A | 2021-01-27 08:32:57.914 | US| |2021-01-27 |英国|用户1 | A | 2021-01-27 08:33:57.914 | MX| |2021-01-27 | GR | user1 | A | 2021-01-27 08:34:57.914 |英国| |2021-01-27 |美国|用户2 | A | 2021-01-27 08:31:57.014 |空| |2021-01-27 | CA | user2 | A | 2021-01-27 08:35:57.914 | US| |2021-01-27 | IR | user2 | A | 2021-01-27 08:36:57.914 | CA| +-----+---+----+----+------------+--+ 如果有多个组和多个GroupID,则按如下说明窗口功能:

>>>w=Window.partitionBygroupId,Username.orderBygroupId,ts,Username