Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Dataframe 如果缺少值,则替换为以前的值_Dataframe_Apache Spark_Pyspark_Apache Spark Sql - Fatal编程技术网

Dataframe 如果缺少值,则替换为以前的值

Dataframe 如果缺少值,则替换为以前的值,dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有以下数据帧: id, test, date 1, A, 01/20/2020 1, B, 01/25/2020 1, C, 01/25/2020 1, A, 02/20/2020 1, B, 02/25/2020 1, C, NA 因为最后一行中C的日期是NA,所以我想查找C的任何以前的日期,并填充它来代替NA 生成的数据帧应为: id, test, date 1, A, 01/20/2020 1, B, 01/25/2020 1, C, 01/25/2020 1, A, 02/20/20

我有以下数据帧:

id, test, date
1, A, 01/20/2020
1, B, 01/25/2020
1, C, 01/25/2020
1, A, 02/20/2020
1, B, 02/25/2020
1, C, NA
因为最后一行中C的日期是NA,所以我想查找C的任何以前的日期,并填充它来代替NA

生成的数据帧应为:

id, test, date
1, A, 01/20/2020
1, B, 01/25/2020
1, C, 01/25/2020
1, A, 02/20/2020
1, B, 02/25/2020
1, C, 01/25/2020

使用窗口
lag
功能和
,否则
检查
日期
列中是否有
NA
,然后替换为上次看到的值

示例:

val df=Seq(("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")).toDF("id","test","date")

import org.apache.spark.sql.expressions.Window

val df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))
val w=Window.partitionBy("id","test").orderBy(desc("new_dt"))

df1.withColumn("date",when(col("date")==="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
//+---+----+----------+
//| id|test|      date|
//+---+----+----------+
//|  1|   A|02/20/2020|
//|  1|   A|01/20/2020|
//|  1|   B|02/25/2020|
//|  1|   B|01/25/2020|
//|  1|   C|01/25/2020|
//|  1|   C|01/25/2020|
//+---+----+----------+
df=spark.createDataFrame([("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")],["id","test","date"])

from pyspark.sql.functions import *

df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))

#change partitionby,orderby as per requirement
w=Window.partitionBy("id","test").orderBy(desc("new_dt"))

df1.withColumn("date",when(col("date")=="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
#+---+----+----------+
#| id|test|      date|
#+---+----+----------+
#|  1|   A|02/20/2020|
#|  1|   A|01/20/2020|
#|  1|   B|02/25/2020|
#|  1|   B|01/25/2020|
#|  1|   C|01/25/2020|
#|  1|   C|01/25/2020|
#+---+----+----------+

Pyspark中的

val df=Seq(("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")).toDF("id","test","date")

import org.apache.spark.sql.expressions.Window

val df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))
val w=Window.partitionBy("id","test").orderBy(desc("new_dt"))

df1.withColumn("date",when(col("date")==="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
//+---+----+----------+
//| id|test|      date|
//+---+----+----------+
//|  1|   A|02/20/2020|
//|  1|   A|01/20/2020|
//|  1|   B|02/25/2020|
//|  1|   B|01/25/2020|
//|  1|   C|01/25/2020|
//|  1|   C|01/25/2020|
//+---+----+----------+
df=spark.createDataFrame([("1","A","01/20/2020"),("1","B","01/25/2020"),("1","C","01/25/2020"),("1","A","02/20/2020"),("1","B","02/25/2020"),("1","C","NA")],["id","test","date"])

from pyspark.sql.functions import *

df1=df.withColumn("new_dt",to_date(col("date"),"MM/dd/yyyy"))

#change partitionby,orderby as per requirement
w=Window.partitionBy("id","test").orderBy(desc("new_dt"))

df1.withColumn("date",when(col("date")=="NA",lag(col("date"),1).over(w)).otherwise(col("date"))).drop("new_dt").show()
#+---+----+----------+
#| id|test|      date|
#+---+----+----------+
#|  1|   A|02/20/2020|
#|  1|   A|01/20/2020|
#|  1|   B|02/25/2020|
#|  1|   B|01/25/2020|
#|  1|   C|01/25/2020|
#|  1|   C|01/25/2020|
#+---+----+----------+