Apache spark 为重复的行序列分配标识符_Apache Spark_Pyspark_Cumsum

Apache spark 为重复的行序列分配标识符

apache-spark pyspark

Apache spark 为重复的行序列分配标识符,apache-spark,pyspark,cumsum,Apache Spark,Pyspark,Cumsum,我有一个数据框，需要在其中生成“CycleID”列，如下所示： +-------+-------+----------+---------+ | type | stage | Timestamp| CycleID | +-------+-------+----------+---------+ | type1 | s1 | a | 1 | | type1 | s2 | b | 1 | | type1 | s2 | c

我有一个数据框，需要在其中生成“CycleID”列，如下所示：

+-------+-------+----------+---------+
| type  | stage | Timestamp| CycleID |
+-------+-------+----------+---------+
| type1 | s1    | a        | 1       |
| type1 | s2    | b        | 1       |
| type1 | s2    | c        | 1       |
| type1 | s3    | d        | 1       |
| type1 | s1    | e        | 2       |
| type1 | s2    | f        | 2       |
| type1 | s3    | g        | 2       |
| type2 | s1    | a        | 1       |
| type2 | s2    | b        | 1       |
| type2 | s3    | c        | 1       |
+-------+-------+----------+---------+

数据约束

一种类型的每一个循环都有三个预定的阶段有序。

一个周期内的各个阶段可以重复，但它们不会无序发生。例如，阶段

s1

永远不会出现在阶段

s2

之后

时间戳保证在每个阶段的行之间递增。例如：

b>a

目标是有一个新列“
CycleID
”，它唯一地标识每种类型的循环。

到目前为止，我所尝试的：

w = Window.partitionBy("type").orderBy("Timestamp")
inputdf = inputdf.withColumn("stagenum", func.expr("substring(stage, 2)")).withColumn("stagenum", col("stagenum").cast(IntegerType()))
inputdf = inputdf.withColumn("temp", func.when((col("stagenum") - func.lag("stagenum", 1).over(w)).isNull() | \
                                                (col("stagenum") - func.lag("stagenum", 1).over(w) == func.lit(0)) |\
                                                (col("stagenum") - func.lag("stagenum", 1).over(w) == func.lit(1)), func.lit(1)).otherwise(func.lit(100)))

除此之外，我还尝试了使用lag（）的不同方法，但似乎没有一种干净的方法来分配CycleId。

正在寻求帮助。

数据

   l=[('type1' , 's1'    , 'a'        , 1),('type1','s2'    , 'b'        , 1  ),('type1' , 's1'    , 'a'        , 1),('type1','s2'    , 'b'        , 1  ), ('type1' , 's2'    , 'c'        , 1), ('type1' , 's3'    , 'd'        , 1),('type1' , 's1'    , 'e'        , 1),('type1','s2'    , 'f'        , 1  ), ('type1' , 's3'    , 'g'        , 1)]
df=spark.createDataFrame(l,['type'  , 'stage' , 'Timestamp', 'CycleID'])
df.show()

解决方案

from pyspark.sql.window import Window
import pyspark.sql.functions as F
from pyspark.sql.functions import *


df=(
 df.withColumn('CycleID',col('stage')=='s1')#Generate Booleans through Selection
 
 .withColumn('CycleID', F.sum(F.col('CycleID').cast('integer'))#Convert Boolean to intergers
             
             .over(Window.partitionBy().orderBy().rowsBetween(-sys.maxsize, 0)))#rowsBetween(-sys.maxsize, 0) along with sum function is used to create cumulative sum of the column 
)
df.show()


+-----+-----+---------+-------+
| type|stage|Timestamp|CycleID|
+-----+-----+---------+-------+
|type1|   s1|        a|      1|
|type1|   s2|        b|      1|
|type1|   s2|        c|      1|
|type1|   s3|        d|      1|
|type1|   s1|        e|      2|
|type1|   s2|        f|      2|
|type1|   s3|        g|      2|
+-----+-----+---------+-------+

以下是你的评论：

请按降序排序并在

s3

上选择布尔值。代码如下

df.sort(col('Timestamp').desc()).withColumn('CycleID',(col('stage')=='s3')).withColumn('CycleID', F.sum(F.col('CycleID').cast('integer')).over(Window.partitionBy().orderBy().rowsBetween(-sys.maxsize, 0))).show()

+-----+-----+---------+-------+
| type|stage|Timestamp|CycleID|
+-----+-----+---------+-------+
|type1|   s3|        g|      1|
|type1|   s2|        f|      1|
|type1|   s1|        e|      1|
|type1|   s3|        d|      2|
|type1|   s2|        c|      2|
|type1|   s2|        b|      2|
|type1|   s2|        b|      2|
|type1|   s1|        a|      2|
|type1|   s1|        a|      2|
+-----+-----+---------+-------+

如果您可能有多个s3。使用滞后，如下所示

 m=Window.partitionBy()#.orderBy(F.desc('Timestamp'))
df1=df.select("*", lag("stage").over(m.orderBy(col("Timestamp"))).alias("CycleID1"))
df1.withColumn('CycleID',(((col('stage')=='s1')&(col('CycleID1').isNull()))|((col('stage')=='s1')&(col('CycleID1')=='s3')))).withColumn('CycleID', F.sum(F.col('CycleID').cast('integer')).over(m.rowsBetween(-sys.maxsize, 0))).drop('CycleID1').show()
+-----+-----+---------+-------+
| type|stage|Timestamp|CycleID|
+-----+-----+---------+-------+
|type1|   s1|        a|      1|
|type1|   s1|        a|      1|
|type1|   s2|        b|      1|
|type1|   s2|        b|      1|
|type1|   s2|        c|      1|
|type1|   s3|        d|      1|
|type1|   s1|        e|      2|
|type1|   s2|        f|      2|
|type1|   s3|        g|      2|
+-----+-----+---------+-------+

这对你有帮助吗？谢谢。它适用于给定的数据集，但如果在同一个周期中有多个“s1”，则会失败。如果使用s3会怎么样。意思是排序并继续。代码在这里

df.sort（col（'Timestamp'）.desc（））.withColumn（'CycleID'，（col（'stage'）='s3'））.withColumn（'CycleID'，F.sum（F.col（'CycleID'））.cast（'integer'））.over（Window.partitionBy（）.orderBy（）.rowsBetween（'sys.maxsize，0））.show（））.show（）

谢谢。识别新周期的唯一方法是查看s1的第一个实例。否则，所有阶段都可以重复。我首先用lag（）来确定s1的第一次出现，然后用你的解来做累计和，你能不能把我的答案投上一票？这里有一个类似的问题。也许答案有帮助。根据我对这个问题的回答，我将使用一个按“类型”列划分的窗口，然后使用一个自定义窗口函数来迭代每种类型的行并应用您的业务逻辑