根据条件替换dataframe中的某些字段

根据条件替换dataframe中的某些字段,dataframe,apache-spark,pyspark,apache-spark-sql,Dataframe,Apache Spark,Pyspark,Apache Spark Sql,我有一个如下的数据帧。以下仅针对1名患者和1个特定测试。它可以有多个外观类似的其他测试 ptid,blast_date,test_name,result_date,test_result,date_diff PT381201021,2019-08-22,Albumin,2019-08-14,4.3,8 PT381201021,2019-05-17,Albumin,NA,NA,0 PT381201021,2019-05-18,Albumin,NA,NA,0 PT381201021,2019-05-

我有一个如下的数据帧。以下仅针对1名患者和1个特定测试。它可以有多个外观类似的其他测试

ptid,blast_date,test_name,result_date,test_result,date_diff
PT381201021,2019-08-22,Albumin,2019-08-14,4.3,8
PT381201021,2019-05-17,Albumin,NA,NA,0
PT381201021,2019-05-18,Albumin,NA,NA,0
PT381201021,2019-05-21,Albumin,NA,NA,0
PT381201021,2019-05-23,Albumin,NA,NA,0
PT381201021,2019-05-16,Albumin,NA,NA,0
PT381201021,2019-05-19,Albumin,NA,NA,0
PT381201021,2019-05-22,Albumin,NA,NA,0
PT381201021,2019-05-20,Albumin,NA,NA,0

我希望本例中“白蛋白”的结果日期、测试结果从上一个爆炸日期开始填充,如果它低于某个阈值,在本例中假设为3个月。因此,我希望填充以下行,如下所示:

PT381201021,2019-05-23,Albumin,2019-08-14,4.3,0
你可以保持日期不变

因此,最终的数据帧预期如下所示:-

ptid,blast_date,test_name,result_date,test_result,date_diff
PT381201021,2019-08-22,Albumin,2019-08-14,4.3,8
PT381201021,2019-05-17,Albumin,NA,NA,0
PT381201021,2019-05-18,Albumin,NA,NA,0
PT381201021,2019-05-21,Albumin,NA,NA,0
PT381201021,2019-05-23,Albumin,2019-08-14,4.3,0
PT381201021,2019-05-16,Albumin,NA,NA,0
PT381201021,2019-05-19,Albumin,NA,NA,0
PT381201021,2019-05-22,Albumin,NA,NA,0
PT381201021,2019-05-20,Albumin,NA,NA,0

我尝试使用滞后函数,但在这方面有一些困难。正在寻找pyspark方法来解决此问题。

希望此方法会有所帮助,尽管不是非常优化,而且在执行流程方面,可以进一步优化

df = spark.read.csv("/Users/61471871.csv", header=True, inferSchema=True)
df2 = df.withColumn("start_date",     F.to_date(df.blast_date)).withColumn("end_date",     F.add_months(F.to_date(df.blast_date),3)).sort(df.start_date.desc())
df_right = df2.sort(df.blast_date.desc())
df3.createOrReplaceTempView("tbl")
spark.sql("select * from tbl").show()
'''
|       ptid|         blast_date|test_name|result_date|test_result|date_diff|
+-----------+-------------------+---------+-----------+-----------+---------+
|PT381201021|2019-08-22 00:00:00|  Albumin| 2019-08-14|        4.3|        8|
|PT381201021|2019-05-23 00:00:00|  Albumin|         NA|         NA|        0|
|PT381201021|2019-05-22 00:00:00|  Albumin|         NA|         NA|        0|
|PT381201021|2019-05-21 00:00:00|  Albumin|         NA|         NA|        0|
|PT381201021|2019-05-20 00:00:00|  Albumin|         NA|         NA|        0|
|PT381201021|2019-05-19 00:00:00|  Albumin|         NA|         NA|        0|
|PT381201021|2019-05-18 00:00:00|  Albumin|         NA|         NA|        0|
|PT381201021|2019-05-17 00:00:00|  Albumin|         NA|         NA|        0|
|PT381201021|2019-05-16 00:00:00|  Albumin|         NA|         NA|        0|
+-----------+-------------------+---------+-----------+-----------+---------+
'''
df2 = df.sort(df.blast_date.desc).withColumn("90_days_back" ,F.add_months(to_date(df.blast_date), 3)).show()
df2 = df.select(F.add_months(df.blast_date, 3).alias('third_month'))


df_left = spark.sql("select ptid, max(start_date) as range_dt from tbl group by ptid ")
df_one = df_right.crossJoin(df_left)

df_right.join(df_left, df_left.ptid = df_right.ptid).show()
df_two = df_one.withColumn("date_diff", F.datediff(df_one.start_date,     df_one.range_dt))
'''
+-----------+-------------------+---------+-----------+-----------+---------+----------+----------+-----------+----------+
|       ptid|         blast_date|test_name|result_date|test_result|date_diff|start_date|  end_date|       ptid|  range_dt|
+-----------+-------------------+---------+-----------+-----------+---------+----------+----------+-----------+----------+
|PT381201021|2019-08-22 00:00:00|  Albumin| 2019-08-14|        4.3|        0|2019-08-22|2019-11-22|PT381201021|2019-08-22|
|PT381201021|2019-05-23 00:00:00|  Albumin|         NA|         NA|      -91|2019-05-23|2019-08-23|PT381201021|2019-08-22|
|PT381201021|2019-05-22 00:00:00|  Albumin|         NA|         NA|      -92|2019-05-22|2019-08-22|PT381201021|2019-08-22|
|PT381201021|2019-05-21 00:00:00|  Albumin|         NA|         NA|      -93|2019-05-21|2019-08-21|PT381201021|2019-08-22|
|PT381201021|2019-05-20 00:00:00|  Albumin|         NA|         NA|      -94|2019-05-20|2019-08-20|PT381201021|2019-08-22|
|PT381201021|2019-05-19 00:00:00|  Albumin|         NA|         NA|      -95|2019-05-19|2019-08-19|PT381201021|2019-08-22|
|PT381201021|2019-05-18 00:00:00|  Albumin|         NA|         NA|      -96|2019-05-18|2019-08-18|PT381201021|2019-08-22|
|PT381201021|2019-05-17 00:00:00|  Albumin|         NA|         NA|      -97|2019-05-17|2019-08-17|PT381201021|2019-08-22|
|PT381201021|2019-05-16 00:00:00|  Albumin|         NA|         NA|      -98|2019-05-16|2019-08-16|PT381201021|2019-08-22|
+-----------+-------------------+---------+-----------+-----------+---------+----------+----------+-----------+----------+
'''
现在您有了日期差异标志,您可以应用筛选器,然后执行联接以获得预期结果


代码可以进一步优化以在大型数据集上运行

您应该使用
窗口功能
,在
秒上的
范围在
之间

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w=Window().partitionBy("ptid","test_name").orderBy(F.to_timestamp("blast_date","yyyy-MM-dd").cast("long")).rangeBetween(Window.currentRow,86400*91)

df.withColumn("collect", F.collect_list(F.array("result_date","test_result")).over(w))\
  .withColumn("collect", F.expr("""filter(collect,x-> array_contains(x,'NA')!=True)""")[0])\
  .withColumn("result_date", F.when((F.col("result_date")=='NA')&(F.col("collect").isNotNull()),F.col("collect")[0]).otherwise(F.col("result_date")))\
  .withColumn("test_result", F.when((F.col("test_result")=='NA')&(F.col("collect").isNotNull()),F.col("collect")[1]).otherwise(F.col("test_result"))).drop("timestamp","collect").show(truncate=False)

+-----------+----------+---------+-----------+-----------+---------+
|ptid       |blast_date|test_name|result_date|test_result|date_diff|
+-----------+----------+---------+-----------+-----------+---------+
|PT381201021|2019-05-16|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-17|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-18|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-19|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-20|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-21|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-22|Albumin  |NA         |NA         |0        |
|PT381201021|2019-05-23|Albumin  |2019-08-14 |4.3        |0        |
|PT381201021|2019-08-22|Albumin  |2019-08-14 |4.3        |8        |
+-----------+----------+---------+-----------+-----------+---------+

在您的示例中,如果
blast_date=2019-05-23
不存在该怎么办?