如何使用Pyspark标记窗口中的最后一行_Pyspark_Pyspark Dataframes

如何使用Pyspark标记窗口中的最后一行

pyspark

如何使用Pyspark标记窗口中的最后一行,pyspark,pyspark-dataframes,Pyspark,Pyspark Dataframes,我的目标是创建一个新列is_end（when是last，前面的p_uuid是null（），然后is_end=1，否则=0。我不知道如何组合when（）和last（）函数我多次尝试与windows结合，但始终出现错误：( 我的数据帧： +---+------+------+----------+ |idx|u_uuid|p_uuid| timestamp| +---+------+------+----------+ | 1| 110| null|2019-09-28| | 2|

我的目标是创建一个新列is_end（when是last，前面的p_uuid是null（），然后is_end=1，否则=0。我不知道如何组合when（）和last（）函数

我多次尝试与windows结合，但始终出现错误：(

我的数据帧：

+---+------+------+----------+
|idx|u_uuid|p_uuid| timestamp|
+---+------+------+----------+
|  1|   110|  null|2019-09-28|
|  2|   110|  null|2019-09-28|
|  3|   110|   aaa|2019-09-28|
|  4|   110|  null|2019-09-17|
|  5|   110|  null|2019-09-17|
|  6|   110|   bbb|2019-09-17|
|  7|   110|  null|2019-09-01|
|  8|   110|  null|2019-09-01|
|  9|   110|  null|2019-09-01|
| 10|   110|  null|2019-09-01|
| 11|   110|   ccc|2019-09-01|
| 12|   110|  null|2019-09-01|
| 13|   110|  null|2019-09-01|
| 14|   110|  null|2019-09-01|
+---+------+------+----------+

w = Window.partitionBy("u_uuid").orderBy(col("timestamp"))
df.withColumn("p_uuid", when( lag(F.col("p_uuid").isNull()).over(w), 1).otherwise(0))

我要找的是：

+---+------+------+----------+------+
|idx|u_uuid|p_uuid| timestamp|is_end|
+---+------+------+----------+------+
|  1|   110|  null|2019-09-28|     0|
|  2|   110|  null|2019-09-28|     0|
|  3|   110|   aaa|2019-09-28|     0|
|  4|   110|  null|2019-09-17|     0|
|  5|   110|  null|2019-09-17|     0|
|  6|   110|   bbb|2019-09-17|     0|
|  7|   110|  null|2019-09-01|     0|
|  8|   110|  null|2019-09-01|     0|
|  9|   110|  null|2019-09-01|     0|
| 10|   110|  null|2019-09-01|     0|
| 11|   110|   ccc|2019-09-01|     0|
| 12|   110|  null|2019-08-29|     1|
| 13|   110|  null|2019-08-29|     1|
| 14|   110|  null|2019-08-29|     1|

以下是pyspark sql与您的案例的关联：

w = (Window
    .partitionBy("u_uuid")
    .orderBy("timestamp"))
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

df.withColumn("is_end", F.when(F.last("p_uuid", True).over(w).isNull() & F.col("p_uuid").isNull(), F.lit(1)).otherwise(F.lit(0)))\
    .show()

谢谢，但我有一个错误“when（）接受2个位置参数，但给出了3个位置参数”对您有效吗！！对我来说，所有行都是0（列是_end），这是pyspark sql与您的伪关联。

w = (Window
    .partitionBy("u_uuid")
    .orderBy("timestamp"))
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

df.withColumn("is_end", F.when(F.last("p_uuid", True).over(w).isNull() & F.col("p_uuid").isNull(), F.lit(1)).otherwise(F.lit(0)))\
    .show()