Apache spark 在Pyspark交互式Shell中查找列的增量

Apache spark 在Pyspark交互式Shell中查找列的增量,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有这个数据框: DataFrame[visitors: int, beach: string, Date: date] 使用以下数据: +-----------+-------------+--------+ |date |beach |visitors| +-----------+------------+---------+ |2020-03-02 |Bondi Beach |205 | |2020-03-02 |Nissi Beach |218

我有这个数据框:

DataFrame[visitors: int, beach: string, Date: date]
使用以下数据:

+-----------+-------------+--------+
|date       |beach        |visitors| 
+-----------+------------+---------+
|2020-03-02 |Bondi Beach |205      |
|2020-03-02 |Nissi Beach |218      |
|2020-03-03 |Bar Beach   |201      |
|2020-03-04 |Navagio     |102      |
|2020-03-04 |Champangne  |233      |
|2020-03-05 |Lighthouse  |500      |
|2020-03-06 |Mazo        |318      |
+-----------+------------+---------+
我想用访客栏中的数据找出这些栏的增量。 预期产出:

+-----------+-------------+--------+-------+
|date       |beach        |visitors| Delta | 
+-----------+------------+---------+-------+
|2020-03-02 |Bondi Beach |205      |-13    | (205-218)
|2020-03-02 |Nissi Beach |218      |17     | (218-201)
|2020-03-03 |Bar Beach   |201      |99     | (201-102)
|2020-03-04 |Navagio     |102      |-131   | (102-233)
|2020-03-04 |Champangne  |233      |-267   | (233-500)
|2020-03-05 |Lighthouse  |500      |182    | (500-318)
|2020-03-06 |Mazo        |318      |318    | (318-0)
+-----------+------------+---------+-------+

您可以使用
lead
功能解决您的问题。由于最后一行的
lead
null
,因此我使用
coalesce
函数将
null
替换为访问者列

from pyspark.sql.window import Window
from pyspark.sql.functions import *

w=Window().orderBy("date")

df.withColumn("delta", col("visitors") - lead("visitors").over(w))\
    .withColumn('delta', coalesce('delta', 'visitors')).show()

+----------+-----------+--------+-----+
|      date|      beach|visitors|delta|
+----------+-----------+--------+-----+
|2020-03-02|Bondi Beach|     205|  -13|
|2020-03-02|Nissi Beach|     218|   17|
|2020-03-03|  Bar Beach|     201|   99|
|2020-03-04|    Navagio|     102| -131|
|2020-03-04| Champangne|     233| -267|
|2020-03-05| Lighthouse|     500|  182|
|2020-03-06|       Mazo|     318|  318|
+----------+-----------+--------+-----+
注意:我只是按
日期
字段排序。最好在ORDERBY子句中包含另一个像id这样的列,以便维护订单。此外,使用不带分区的窗口可能会影响性能