Dataframe 如果缺少值,则替换为以前的值
我有以下数据帧:Dataframe 如果缺少值,则替换为以前的值,dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有以下数据帧: id, test, date 1, A, 01/20/2020 1, B, 01/25/2020 1, C, 01/25/2020 1, A, 02/20/2020 1, B, 02/25/2020 1, C, NA 因为最后一行中C的日期是NA,所以我想查找C的任何以前的日期,并填充它来代替NA 生成的数据帧应为: id, test, date 1, A, 01/20/2020 1, B, 01/25/2020 1, C, 01/25/2020 1, A, 02/20/20
id, test, date
1, A, 01/20/2020
1, B, 01/25/2020
1, C, 01/25/2020
1, A, 02/20/2020
1, B, 02/25/2020
1, C, NA
因为最后一行中C的日期是NA,所以我想查找C的任何以前的日期,并填充它来代替NA
生成的数据帧应为:
id, test, date
1, A, 01/20/2020
1, B, 01/25/2020
1, C, 01/25/2020
1, A, 02/20/2020
1, B, 02/25/2020
1, C, 01/25/2020
使用窗口
lag
功能和,否则
检查日期
列中是否有NA
,然后替换为上次看到的值
示例:
val df=Seq(("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")).toDF("id","test","date")
import org.apache.spark.sql.expressions.Window
val df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))
val w=Window.partitionBy("id","test").orderBy(desc("new_dt"))
df1.withColumn("date",when(col("date")==="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
//+---+----+----------+
//| id|test| date|
//+---+----+----------+
//| 1| A|02/20/2020|
//| 1| A|01/20/2020|
//| 1| B|02/25/2020|
//| 1| B|01/25/2020|
//| 1| C|01/25/2020|
//| 1| C|01/25/2020|
//+---+----+----------+
df=spark.createDataFrame([("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")],["id","test","date"])
from pyspark.sql.functions import *
df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))
#change partitionby,orderby as per requirement
w=Window.partitionBy("id","test").orderBy(desc("new_dt"))
df1.withColumn("date",when(col("date")=="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
#+---+----+----------+
#| id|test| date|
#+---+----+----------+
#| 1| A|02/20/2020|
#| 1| A|01/20/2020|
#| 1| B|02/25/2020|
#| 1| B|01/25/2020|
#| 1| C|01/25/2020|
#| 1| C|01/25/2020|
#+---+----+----------+
Pyspark中的
val df=Seq(("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")).toDF("id","test","date")
import org.apache.spark.sql.expressions.Window
val df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))
val w=Window.partitionBy("id","test").orderBy(desc("new_dt"))
df1.withColumn("date",when(col("date")==="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
//+---+----+----------+
//| id|test| date|
//+---+----+----------+
//| 1| A|02/20/2020|
//| 1| A|01/20/2020|
//| 1| B|02/25/2020|
//| 1| B|01/25/2020|
//| 1| C|01/25/2020|
//| 1| C|01/25/2020|
//+---+----+----------+
df=spark.createDataFrame([("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")],["id","test","date"])
from pyspark.sql.functions import *
df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))
#change partitionby,orderby as per requirement
w=Window.partitionBy("id","test").orderBy(desc("new_dt"))
df1.withColumn("date",when(col("date")=="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
#+---+----+----------+
#| id|test| date|
#+---+----+----------+
#| 1| A|02/20/2020|
#| 1| A|01/20/2020|
#| 1| B|02/25/2020|
#| 1| B|01/25/2020|
#| 1| C|01/25/2020|
#| 1| C|01/25/2020|
#+---+----+----------+