Pyspark滞后函数返回null

Pyspark滞后函数返回null,pyspark,window-functions,lag,Pyspark,Window Functions,Lag,我有一个像这样的数据框 >>> df.show() +----------------------+------------------------+--------------------+ |date_cast |id | status | +----------------------+------------------------+--------------------+ |

我有一个像这样的数据框

>>> df.show()
+----------------------+------------------------+--------------------+
|date_cast             |id                      |         status    |
+----------------------+------------------------+--------------------+
|            2021-02-20|    123...              |open                |
|            2021-02-21|    123...              |open                |
|            2021-02-17|    123...              |closed              |
|            2021-02-22|    123...              |open                |
|            2021-02-19|    123...              |open                |
|            2021-02-18|    123...              |closed              |
+----------------------+------------------------+--------------------+
我一直在尝试对它应用一个非常简单的延迟,以查看它前一天的状态是什么,但我一直得到空值。日期是一个字符串,所以我铸造,认为这可能是由于日期没有排序结果的问题。我还硬编码了我的over partition by中的窗口,但仍然得到null

df_lag = df.withColumn('lag_status',F.lag(df['status']) \
                                 .over(Window.partitionBy("date_cast").orderBy(F.asc('date_cast')))).show()
有人能帮助解决以下问题吗

>>> column_list = ["date_cast","id"]
>>> win_spec = Window.partitionBy([F.col(x) for x in column_list]).orderBy(F.asc('date_cast'))
>>> df.withColumn('lag_status', F.lag('status').over(
...     win_spec
...     )
... )

+----------------------+------------------------+--------------------+-----------+
|date_cast             |id.                      |         staus      |lag_status|
+----------------------+------------------------+--------------------+-----------+
|            2021-02-19|    123...              |open                |       null|
|            2021-02-21|    123...              |open                |       null|
|            2021-02-17|    123...              |open                |       null|
|            2021-02-18|    123...              |open                |       null|
|            2021-02-22|    123...              |open                |       null|
|            2021-02-20|    123...              |open                |       null|
+----------------------+------------------------+--------------------+-----------+

发生这种情况是因为您按日期进行了数据分区,并且日期\u cast具有唯一的值。使用“id”代替date\u cast,例如:

df_lag = df.withColumn('lag_status',F.lag(df['status']) \
                                 .over(Window.partitionBy("id").orderBy(F.asc('date_cast')))).show()