Apache spark 根据列中的过滤条件和pyspark中的窗口周期计算平均值
我有一个pyspark数据帧:Apache spark 根据列中的过滤条件和pyspark中的窗口周期计算平均值,apache-spark,filter,pyspark,apache-spark-sql,average,Apache Spark,Filter,Pyspark,Apache Spark Sql,Average,我有一个pyspark数据帧: date | cust | amount | is_delinquent --------------------------------------- 1/1/20 | A | 5 | 0 13/1/20 | A | 1 | 0 15/1/20 | A | 3 | 1 19/1/20 | A | 4 | 0 20/1/20 |
date | cust | amount | is_delinquent
---------------------------------------
1/1/20 | A | 5 | 0
13/1/20 | A | 1 | 0
15/1/20 | A | 3 | 1
19/1/20 | A | 4 | 0
20/1/20 | A | 4 | 1
27/1/20 | A | 2 | 0
1/2/20 | A | 2 | 0
5/2/20 | A | 1 | 0
1/1/20 | B | 7 | 0
1/1/20 | B | 5 | 0
现在,我想计算30天窗口期内金额的平均值,过滤列是否拖欠等于0
。当拖欠等于1时,它应该跳过,并替换为NaN
我预期的最终数据帧是:
date | cust | amount | is_delinquent | avg_amount
----------------------------------------------------------
1/1/20 | A | 5 | 0 | null
13/1/20 | A | 1 | 0 | 5
15/1/20 | A | 3 | 1 | null
19/1/20 | A | 4 | 0 | 3
20/1/20 | A | 4 | 1 | null
27/1/20 | A | 2 | 0 | 3.333
1/2/20 | A | 2 | 0 | null
5/2/20 | A | 1 | 0 | 2
1/1/20 | B | 7 | 0 | null
9/1/20 | B | 5 | 0 | 7
如果没有过滤,我的代码如下:
import pyspark.sql.functions as F
from pyspark.sql.window import Window
days = lambda i: i * 86400
w_pay_30x = Window.partitionBy("cust").orderBy(col("date").cast("timestamp").cast("long")).rangeBetween(-days(30), -days(1))
data.withColumn("avg_amount", F.avg("amount").over(w_pay_30x)
知道如何添加此筛选器吗?只有当拖欠等于0时,才可以使用计算并显示平均值。此外,您可能希望在窗口的分区by
子句中包含月份
from pyspark.sql import functions as F, Window
days = lambda i: i * 86400
w_pay_30x = (Window.partitionBy("cust", F.month(F.to_timestamp('date', 'd/M/yy')))
.orderBy(F.to_timestamp('date', 'd/M/yy').cast('long'))
.rangeBetween(-days(30), -days(1))
)
data2 = data.withColumn(
'avg_amount',
F.when(
F.col('is_delinquent') == 0,
F.avg(
F.when(
F.col('is_delinquent') == 0,
F.col('amount')
)
).over(w_pay_30x)
)
).orderBy('cust', F.to_timestamp('date', 'd/M/yy'))
data2.show()
+-------+----+------+-------------+------------------+
| date|cust|amount|is_delinquent| avg_amount|
+-------+----+------+-------------+------------------+
| 1/1/20| A| 5| 0| null|
|13/1/20| A| 1| 0| 5.0|
|15/1/20| A| 3| 1| null|
|19/1/20| A| 4| 0| 3.0|
|20/1/20| A| 4| 1| null|
|27/1/20| A| 2| 0|3.3333333333333335|
| 1/2/20| A| 2| 0| null|
| 5/2/20| A| 1| 0| 2.0|
| 1/1/20| B| 7| 0| null|
| 9/1/20| B| 5| 0| 7.0|
+-------+----+------+-------------+------------------+
只有当拖欠等于0时,才可以使用计算并显示平均值。此外,您可能希望在窗口的分区by
子句中包含月份
from pyspark.sql import functions as F, Window
days = lambda i: i * 86400
w_pay_30x = (Window.partitionBy("cust", F.month(F.to_timestamp('date', 'd/M/yy')))
.orderBy(F.to_timestamp('date', 'd/M/yy').cast('long'))
.rangeBetween(-days(30), -days(1))
)
data2 = data.withColumn(
'avg_amount',
F.when(
F.col('is_delinquent') == 0,
F.avg(
F.when(
F.col('is_delinquent') == 0,
F.col('amount')
)
).over(w_pay_30x)
)
).orderBy('cust', F.to_timestamp('date', 'd/M/yy'))
data2.show()
+-------+----+------+-------------+------------------+
| date|cust|amount|is_delinquent| avg_amount|
+-------+----+------+-------------+------------------+
| 1/1/20| A| 5| 0| null|
|13/1/20| A| 1| 0| 5.0|
|15/1/20| A| 3| 1| null|
|19/1/20| A| 4| 0| 3.0|
|20/1/20| A| 4| 1| null|
|27/1/20| A| 2| 0|3.3333333333333335|
| 1/2/20| A| 2| 0| null|
| 5/2/20| A| 1| 0| 2.0|
| 1/1/20| B| 7| 0| null|
| 9/1/20| B| 5| 0| 7.0|
+-------+----+------+-------------+------------------+