Apache spark 在Pyspark交互式Shell中查找列的增量
我有这个数据框:Apache spark 在Pyspark交互式Shell中查找列的增量,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,我有这个数据框: DataFrame[visitors: int, beach: string, Date: date] 使用以下数据: +-----------+-------------+--------+ |date |beach |visitors| +-----------+------------+---------+ |2020-03-02 |Bondi Beach |205 | |2020-03-02 |Nissi Beach |218
DataFrame[visitors: int, beach: string, Date: date]
使用以下数据:
+-----------+-------------+--------+
|date |beach |visitors|
+-----------+------------+---------+
|2020-03-02 |Bondi Beach |205 |
|2020-03-02 |Nissi Beach |218 |
|2020-03-03 |Bar Beach |201 |
|2020-03-04 |Navagio |102 |
|2020-03-04 |Champangne |233 |
|2020-03-05 |Lighthouse |500 |
|2020-03-06 |Mazo |318 |
+-----------+------------+---------+
我想用访客栏中的数据找出这些栏的增量。
预期产出:
+-----------+-------------+--------+-------+
|date |beach |visitors| Delta |
+-----------+------------+---------+-------+
|2020-03-02 |Bondi Beach |205 |-13 | (205-218)
|2020-03-02 |Nissi Beach |218 |17 | (218-201)
|2020-03-03 |Bar Beach |201 |99 | (201-102)
|2020-03-04 |Navagio |102 |-131 | (102-233)
|2020-03-04 |Champangne |233 |-267 | (233-500)
|2020-03-05 |Lighthouse |500 |182 | (500-318)
|2020-03-06 |Mazo |318 |318 | (318-0)
+-----------+------------+---------+-------+
您可以使用
lead
功能解决您的问题。由于最后一行的lead
是null
,因此我使用coalesce
函数将null
替换为访问者列
from pyspark.sql.window import Window
from pyspark.sql.functions import *
w=Window().orderBy("date")
df.withColumn("delta", col("visitors") - lead("visitors").over(w))\
.withColumn('delta', coalesce('delta', 'visitors')).show()
+----------+-----------+--------+-----+
| date| beach|visitors|delta|
+----------+-----------+--------+-----+
|2020-03-02|Bondi Beach| 205| -13|
|2020-03-02|Nissi Beach| 218| 17|
|2020-03-03| Bar Beach| 201| 99|
|2020-03-04| Navagio| 102| -131|
|2020-03-04| Champangne| 233| -267|
|2020-03-05| Lighthouse| 500| 182|
|2020-03-06| Mazo| 318| 318|
+----------+-----------+--------+-----+
注意:我只是按日期
字段排序。最好在ORDERBY子句中包含另一个像id这样的列,以便维护订单。此外,使用不带分区的窗口可能会影响性能