Pyspark 对具有相同ID号(列值)的行应用转换

Pyspark 对具有相同ID号(列值)的行应用转换,pyspark,Pyspark,使用Pyspark,我创建了一个函数,在给定某个DF(仅1行DF)的情况下,该函数精确地给出了我想要的东西。我现在希望将这些转换应用于整个数据帧。 问题在于,每一行在approval列中都包含多行。我能够使用explode并提取所需的信息,但是,还有一些函数使用了前几行中的值。我需要确保每次函数介入时,具有不同ID号的所有行都被一起处理 给出一个想法: def function(test): test = test.withColumn("Approval", F.c

使用Pyspark,我创建了一个函数,在给定某个DF(仅1行DF)的情况下,该函数精确地给出了我想要的东西。我现在希望将这些转换应用于整个数据帧。 问题在于,每一行在approval列中都包含多行。我能够使用explode并提取所需的信息,但是,还有一些函数使用了前几行中的值。我需要确保每次函数介入时,具有不同ID号的所有行都被一起处理

给出一个想法:

def function(test):
    test = test.withColumn("Approval", F.concat(F.lit('REVISION'), F.lit('|^|'), F.col('Approval')))
    test = test.withColumn('Approval', explode(split(col("Approval"), "\\|\\^\\|")))
    test = test.withColumn("Date", split(col("Approval"), "~").getItem(6)).withColumn("SESA", split(col("Approval"), "~").getItem(2)).withColumn("Status", split(col("Approval"), "~").getItem(5))
    test = test.filter(~((col('Approval')== "REVISION") & (col('ids_last_version') == "false"))).orderBy(test.ids_last_version.desc()).dropDuplicates(['Approval','Date','SESA'])
    test = test.select('*',json_tuple(test.Revisions,'revisedDate', 'addeditems','changedItems','droppeditems').alias('revisedDate', 'addeditems','changedItems','droppeditems'))
    test = test.withColumn("revisedDate", when(col("Approval") != "REVISION", "-").otherwise(col("revisedDate"))).withColumn("addeditems", when(col("Approval") != "REVISION", "-").otherwise(col("addeditems"))).withColumn("changedItems", when(col("Approval") != "REVISION", "-").otherwise(col("changedItems"))).withColumn("droppeditems", when(col("Approval") != "REVISION", "-").otherwise(col("droppeditems")))
    test = test.orderBy(test.Date.asc())
    my_window = Window.partitionBy().orderBy("Date")
    test = test.orderBy("ID").withColumn("Last_Action", F.lag(test.Status).over(my_window)).withColumn("Last_Action_Date", F.lag(test.Date).over(my_window))
    last_submission_date = test.filter(((col('Status')== "Approved"))).select("Date").collect()[0][0]
    test = test.withColumn('last_submission_date', lit(last_submission_date))
    return(test)
假设我有一个DF

+-----+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------+
|ID   |Approval                                                                                     |Revisions                                                                                                  |ids_last_version|
+-----+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------+
|11111|A1 ~B1~C1~D1~E1~F1~2020-06-05 10:33:00.0|^|A2 ~B2~C2~D2~E2~F2~2020-06-05 10:44:09.0          |{"addeditems":{Ajson},"changedItems":{Bjson},"droppeditems":{Cjson},"revisedDate":"2020-06-06 10:23:58"}   |true            |
|22222|AA1~BB1~CC1~DD1~EE1~FF1~2019-12-20 22:14:23.0|^|AA2~BB2~CC2~DD2~EE2~FF2~2019-12-22 06:34:31.0|{"addeditems":{AAjson},"changedItems":{BBjson},"droppeditems":{CCjson},"revisedDate":"2020-06-06 10:23:58"}|true            |
+-----+---------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------+----------------+
基本上,我一分解数据,它就不一定会在相同的ID列中进行选择。这一切适用合理吗