Pyspark 用上一个和下一个非缺失值填写行缺失值_Pyspark_Apache Spark Sql_Pyspark Dataframes

Pyspark 用上一个和下一个非缺失值填写行缺失值

pyspark

Pyspark 用上一个和下一个非缺失值填写行缺失值,pyspark,apache-spark-sql,pyspark-dataframes,Pyspark,Apache Spark Sql,Pyspark Dataframes,我知道您可以使用最后一个函数和窗口函数组合使用下一个非缺失值向前/向后填充缺失值但我有一个数据看起来像： Area,Date,Population A, 1/1/2000, 10000 A, 2/1/2000, A, 3/1/2000, A, 4/1/2000, 10030 A, 5/1/2000, 在这个例子中，对于五月人口，我喜欢填写10030，这很简单。但对于二月和三月，我想填写的值是10000和10030的平均值，而不是10000或10030 你知道如何实现这一点吗谢谢，这是

我知道您可以使用最后一个函数和窗口函数组合使用下一个非缺失值向前/向后填充缺失值

但我有一个数据看起来像：

Area,Date,Population
A, 1/1/2000, 10000
A, 2/1/2000, 
A, 3/1/2000, 
A, 4/1/2000, 10030
A, 5/1/2000,

在这个例子中，对于五月人口，我喜欢填写10030，这很简单。但对于二月和三月，我想填写的值是10000和10030的平均值，而不是10000或10030

你知道如何实现这一点吗

谢谢，这是我的尝试

w1

和

w2

用于划分窗口，而

w3

和

w4

用于填充前面和后面的值。然后，您可以给出计算填充

总体的条件
import pyspark.sql.functions as f
from pyspark.sql import Window

w1 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.unboundedPreceding, Window.currentRow)
w2 = Window.partitionBy('Area').orderBy('Date').rowsBetween(Window.currentRow, Window.unboundedFollowing)
w3 = Window.partitionBy('Area', 'partition1').orderBy('Date')
w4 = Window.partitionBy('Area', 'partition2').orderBy(f.desc('Date'))

df.withColumn('check', f.col('Population').isNotNull().cast('int')) \
  .withColumn('partition1', f.sum('check').over(w1)) \
  .withColumn('partition2', f.sum('check').over(w2)) \
  .withColumn('first', f.first('Population').over(w3)) \
  .withColumn('last',  f.first('Population').over(w4)) \
  .withColumn('fill', f.when(f.col('first').isNotNull() & f.col('last').isNotNull(), (f.col('first') + f.col('last')) / 2).otherwise(f.coalesce('first', 'last'))) \
  .withColumn('Population', f.coalesce('Population', 'fill')) \
  .orderBy('Date') \
  .select(*df.columns).show(10, False)

+----+--------+----------+
|Area|Date    |Population|
+----+--------+----------+
|A   |1/1/2000|10000.0   |
|A   |2/1/2000|10015.0   |
|A   |3/1/2000|10015.0   |
|A   |4/1/2000|10030.0   |
|A   |5/1/2000|10030.0   |
+----+--------+----------+

获取下一个
和上一个
值，并计算平均值，如下所示-
df2.show（false）
df2.printSchema（）
/**
* +----+--------+----------+
*|面积|日期|人口|
* +----+--------+----------+
*| A | 1/1/2000 | 10000|
*| A | 2/1/2000 |空|
*| A | 3/1/2000 |空|
*| A | 4/1/2000 | 10030|
*| A | 5/1/2000 |空|
* +----+--------+----------+
*
*根
*|--区域：字符串（nullable=true）
*|--Date:string（nullable=true）
*|--总体：整数（可空=真）
*/
val w1=窗口.partitionBy（“区域”）.orderBy（“日期”）.rowsBetween（窗口.unboundedReceiding，窗口.currentRow）
val w2=Window.partitionBy（“区域”）.orderBy（“日期”）.rowsBetween（Window.currentRow，Window.unboundedFollowing）
df2.withColumn（“上一个”，最后一个（“总体”，ignoreNulls=true）。超过（w1））
.withColumn（“下一个”，第一个（“总体”，ignoreNulls=true）。over（w2））
.withColumn（“新人口”、（合并（$“上一个”、$“下一个”）+合并（$“下一个”、$“上一个”）/2）
.drop（“下一个”、“上一个”）
.show（假）
/**
* +----+--------+----------+--------------+
*|地区|日期|人口|新人口|
* +----+--------+----------+--------------+
*| A | 1/1/2000 | 10000 | 10000.0|
*| A | 2/1/2000 |空| 10015.0|
*| A | 3/1/2000 |空| 10015.0|
*| A | 4/1/2000 | 10030 | 10030.0|
*| A | 5/1/2000 | null | 10030.0|
* +----+--------+----------+--------------+
*/
这是否回答了您的问题？