Loops pyspark通过窗口迭代计算累积最大值

Loops pyspark通过窗口迭代计算累积最大值,loops,apache-spark,pyspark,window,Loops,Apache Spark,Pyspark,Window,我有一个带有车辆ID、时间戳和里程表的数据帧。某些里程表读数可能为零。我想创建一个新列,它是每个车辆ID的时间戳的当前里程表,如果为空,则使用以前的无空里程表 范例 +------------+------------------------+-----------+-------------------------+ |vehicleID |startDateTimeUtc |Odometer |NewColumn-CurrentOdometer| +----------

我有一个带有车辆ID、时间戳和里程表的数据帧。某些里程表读数可能为零。我想创建一个新列,它是每个车辆ID的时间戳的当前里程表,如果为空,则使用以前的无空里程表

范例

+------------+------------------------+-----------+-------------------------+
|vehicleID   |startDateTimeUtc        |Odometer   |NewColumn-CurrentOdometer|
+------------+------------------------+-----------+-------------------------+
|a           |2019-04-11T16:27:32+0000|10000      |10000                    |
|a           |2019-04-11T16:27:32+0000|15000      |15000                    |
|a           |2019-04-11T16:43:10+0000|null       |15000                    |
|a           |2019-04-11T20:13:52+0000|null       |15000                    |
|a           |2019-04-12T14:50:35+0000|null       |15000                    |
|a           |2019-04-12T18:53:19+0000|20000      |20000                    |
|b           |2019-04-12T19:06:41+0000|350000     |350000                   |
|b           |2019-04-12T19:17:15+0000|370000     |370000                   |
|b           |2019-04-12T19:30:32+0000|null       |370000                   |
|b           |2019-04-12T20:19:41+0000|380000     |380000                   |
|b           |2019-04-12T20:42:26+0000|null       |380000                   |
我知道我需要使用窗口功能。我可能也需要使用“lag”,但我如何才能查找到以前的记录(参见示例vehicleID a) 非常感谢你

my_window = Window.partitionBy("vehicleID").orderBy("vehicleID","startDateTimeUtc")

使用
最后一个窗口函数
,将ignoreNulls标记为
,并将行置于
无界接收行和当前行
之间

df.show(20,False)
#+---------+------------------------+--------+
#|vehicleid|startdatetimeutc        |odometer|
#+---------+------------------------+--------+
#|a        |2019-04-11T16:27:32+0000|10000   |
#|a        |2019-04-11T16:27:32+0000|15000   |
#|a        |2019-04-11T16:43:10+0000|null    |
#|a        |2019-04-11T20:13:52+0000|null    |
#|a        |2019-04-12T14:50:35+0000|null    |
#|a        |2019-04-12T18:53:19+0000|20000   |
#|b        |2019-04-12T19:06:41+0000|350000  |
#|b        |2019-04-12T19:17:15+0000|370000  |
#|b        |2019-04-12T19:30:32+0000|null    |
#|b        |2019-04-12T20:19:41+0000|380000  |
#|b        |2019-04-12T20:42:26+0000|null    |
#+---------+------------------------+--------+

import sys
my_window = Window.partitionBy("vehicleID").orderBy("vehicleID","startDateTimeUtc").rowsBetween(-sys.maxsize,0)

df.withColumn("NewColumn-CurrentOdometer",last(col("Odometer"),True).over(my_window)).orderBy("vehicleid").show(20,False)
#+---------+------------------------+--------+-------------------------+
#|vehicleid|startdatetimeutc        |odometer|NewColumn-CurrentOdometer|
#+---------+------------------------+--------+-------------------------+
#|a        |2019-04-11T16:27:32+0000|10000   |10000                    |
#|a        |2019-04-11T16:27:32+0000|15000   |15000                    |
#|a        |2019-04-11T16:43:10+0000|null    |15000                    |
#|a        |2019-04-11T20:13:52+0000|null    |15000                    |
#|a        |2019-04-12T14:50:35+0000|null    |15000                    |
#|a        |2019-04-12T18:53:19+0000|20000   |20000                    |
#|b        |2019-04-12T19:06:41+0000|350000  |350000                   |
#|b        |2019-04-12T19:17:15+0000|370000  |370000                   |
#|b        |2019-04-12T19:30:32+0000|null    |370000                   |
#|b        |2019-04-12T20:19:41+0000|380000  |380000                   |
#|b        |2019-04-12T20:42:26+0000|null    |380000                   |
#+---------+------------------------+--------+-------------------------+
另一种选择- 将最大值
与窗口框
一起使用,无边界接收和当前行

加载提供的测试数据
val数据=
"""
|车辆ID |起始日期UTC |里程表
|a | 2019-04-11T16:27:32+0000 | 10000
|a | 2019-04-11T16:27:32+0000 | 15000
|a | 2019-04-11T16:43:10+0000 |空
|a | 2019-04-11T20:13:52+0000 |空
|a | 2019-04-12T14:50:35+0000 |空
|a | 2019-04-12T18:53:19+0000 | 20000
|b | 2019-04-12T19:06:41+0000 | 350000
|b | 2019-04-12T19:17:15+0000 | 370000
|b | 2019-04-12T19:30:32+0000 |空
|b | 2019-04-12T20:19:41+0000 | 380000
|b | 2019-04-12T20:42:26+0000 |空
“.stripMargin”
val stringDS1=data.split(System.lineSeparator())
.map(\\\\\).map(\.replaceAll(“^[\t]+\\t]+$”,“).mkString(“,”)
.toSeq.toDS()
val df1=spark.read
.期权(“sep”、“、”)
.选项(“推断模式”、“真”)
.选项(“标题”、“正确”)
.选项(“空值”、“空值”)
.csv(stringDS1)
df1.显示(错误)
df1.printSchema()
/**
* +---------+------------------------+--------+
*|车辆ID |起始日期UTC |里程表|
* +---------+------------------------+--------+
*| a | 2019-04-11T16:27:32+0000 | 10000|
*| a | 2019-04-11T16:27:32+0000 | 15000|
*| a | 2019-04-11T16:43:10+0000 |空|
*| a | 2019-04-11T20:13:52+0000 |空|
*| a | 2019-04-12T14:50:35+0000 |空|
*| a | 2019-04-12T18:53:19+0000 | 20000|
*| b | 2019-04-12T19:06:41+0000 | 350000|
*| b | 2019-04-12T19:17:15+0000 | 370000|
*| b | 2019-04-12T19:30:32+0000 |空|
*| b | 2019-04-12T20:19:41+0000 | 380000|
*| b | 2019-04-12T20:42:26+0000 |空|
* +---------+------------------------+--------+
*
*根
*|--vehicleID:string(nullable=true)
*|--startDateTimeUtc:string(nullable=true)
*|--里程表:整数(可为空=真)
*/
计算具有order by的窗口中的最大值
val w=Window.partitionBy(“车辆ID”).orderBy(“startDateTimeUtc”)
.rowsBetween(Window.unbounddpreceiding,Window.currentRow)
df1.带列(“新列电流里程表”,
最大(“里程表”)。超过(w))
.show(假)
/**
* +---------+------------------------+--------+-------------------------+
*|车辆ID |起始日期UTC |里程表|新列电流里程表|
* +---------+------------------------+--------+-------------------------+
*| a | 2019-04-11T16:27:32+0000 | 10000 | 10000|
*| a | 2019-04-11T16:27:32+0000 | 15000 | 15000|
*| a | 2019-04-11T16:43:10+0000 |空| 15000|
*| a | 2019-04-11T20:13:52+0000 |空| 15000|
*| a | 2019-04-12T14:50:35+0000 |空| 15000|
*| a | 2019-04-12T18:53:19+0000 | 20000 | 20000|
*| b | 2019-04-12T19:06:41+0000 | 350000 | 350000|
*| b | 2019-04-12T19:17:15+0000 | 370000 | 370000|
*| b | 2019-04-12T19:30:32+0000 |空| 370000|
*| b | 2019-04-12T20:19:41+0000 | 380000 | 380000|
*| b | 2019-04-12T20:42:26+0000 |空| 380000|
* +---------+------------------------+--------+-------------------------+
*/

谢谢!!这正是我所需要的。:)