Apache spark 具有特定值的解析函数
我有一个数据如下Apache spark 具有特定值的解析函数,apache-spark,pyspark,Apache Spark,Pyspark,我有一个数据如下 arrayData = [ ('abc','PN1','SN1','2021-02-03 10:20:11','','Fail'), ('abc','PN1','SN1','2021-02-03 10:20:15','','Fail'), ('abc','PN1','SN1','2021-02-03 10:20:19','','Fail'), ('abc','PN1','SN1','2021-02-03 10:21:11','2021-02-03 10:21:1
arrayData = [
('abc','PN1','SN1','2021-02-03 10:20:11','','Fail'),
('abc','PN1','SN1','2021-02-03 10:20:15','','Fail'),
('abc','PN1','SN1','2021-02-03 10:20:19','','Fail'),
('abc','PN1','SN1','2021-02-03 10:21:11','2021-02-03 10:21:19','Success'),
('abc','PN1','SN1','2021-02-03 10:22:19','','Fail'),
('abc','PN1','SN1','2021-02-03 10:22:29','','Fail'),
('abc','PN1','SN1','2021-02-03 10:22:39','','Fail'),
('abc','PN1','SN1','2021-02-03 10:22:49','','Fail'),
('abc','PN1','SN1','2021-02-03 10:22:59','','Fail'),
('abc','PN1','SN1','2021-02-03 10:31:11','2021-02-03 10:31:19','Success'),
('abc','PN1','SN1','2021-02-03 10:31:21','2021-02-03 10:32:19','Success'),
('abc','PN1','SN1','2021-02-03 11:32:49','','Fail'),
('abc','PN1','SN1','2021-02-03 11:34:59','','Fail'),
('abc','PN1','SN2','2021-02-03 10:22:49','','Fail'),
('abc','PN1','SN2','2021-02-03 10:22:59','','Fail')
]
root
|-- event: string (nullable = true)
|-- PN: string (nullable = true)
|-- SN: string (nullable = true)
|-- Claim_Start: string (nullable = true)
|-- Claim_End: string (nullable = true)
|-- Status: string (nullable = true)
+-----+---+---+-------------------+-------------------+-------+
|event| PN| SN| Claim_Start| Claim_End| Status|
+-----+---+---+-------------------+-------------------+-------+
| abc|PN1|SN1|2021-02-03 10:20:11| | Fail|
| abc|PN1|SN1|2021-02-03 10:20:15| | Fail|
| abc|PN1|SN1|2021-02-03 10:20:19| | Fail|
| abc|PN1|SN1|2021-02-03 10:21:11|2021-02-03 10:21:19|Success|
| abc|PN1|SN1|2021-02-03 10:22:19| | Fail|
| abc|PN1|SN1|2021-02-03 10:22:29| | Fail|
| abc|PN1|SN1|2021-02-03 10:22:39| | Fail|
| abc|PN1|SN1|2021-02-03 10:22:49| | Fail|
| abc|PN1|SN1|2021-02-03 10:22:59| | Fail|
| abc|PN1|SN1|2021-02-03 10:31:11|2021-02-03 10:31:19|Success|
| abc|PN1|SN1|2021-02-03 10:31:21|2021-02-03 10:32:19|Success|
| abc|PN1|SN1|2021-02-03 11:32:49| | Fail|
| abc|PN1|SN1|2021-02-03 11:34:59| | Fail|
| abc|PN1|SN2|2021-02-03 10:22:49| | Fail|
| abc|PN1|SN2|2021-02-03 10:22:59| | Fail|
+-----+---+---+-------------------+-------------------+-------+
我做了一些如下的转换
df2 = df.withColumn("event_start_time",f.to_timestamp(df.Claim_Start,'yyyy-MM-dd HH:mm:ss')).withColumn("event_end_time",f.to_timestamp(df.Claim_End,'yyyy-MM-dd HH:mm:ss'))
df2 = df2.drop("Claim_Start").drop("Claim_End")
+-----+---+---+-------+-------------------+-------------------+
|event| PN| SN| Status| event_start_time| event_end_time|
+-----+---+---+-------+-------------------+-------------------+
| abc|PN1|SN1| Fail|2021-02-03 10:20:11| null|
| abc|PN1|SN1| Fail|2021-02-03 10:20:15| null|
| abc|PN1|SN1| Fail|2021-02-03 10:20:19| null|
| abc|PN1|SN1|Success|2021-02-03 10:21:11|2021-02-03 10:21:19|
| abc|PN1|SN1| Fail|2021-02-03 10:22:19| null|
| abc|PN1|SN1| Fail|2021-02-03 10:22:29| null|
| abc|PN1|SN1| Fail|2021-02-03 10:22:39| null|
| abc|PN1|SN1| Fail|2021-02-03 10:22:49| null|
| abc|PN1|SN1| Fail|2021-02-03 10:22:59| null|
| abc|PN1|SN1|Success|2021-02-03 10:31:11|2021-02-03 10:31:19|
| abc|PN1|SN1|Success|2021-02-03 10:31:21|2021-02-03 10:32:19|
| abc|PN1|SN1| Fail|2021-02-03 11:32:49| null|
| abc|PN1|SN1| Fail|2021-02-03 11:34:59| null|
| abc|PN1|SN2| Fail|2021-02-03 10:22:49| null|
| abc|PN1|SN2| Fail|2021-02-03 10:22:59| null|
+-----+---+---+-------+-------------------+-------------------+
我需要的输出
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
| PN| SN|event| status| event_start_time| event_end_time| first_try|num_attempts|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
|PN1|SN1| abc| Fail|2021-02-03 11:32:49| |2021-02-03 11:32:49| 2|
|PN1|SN1| abc|Success|2021-02-03 10:21:11|2021-02-03 10:21:19|2021-02-03 10:20:11| 4|
|PN1|SN1| abc|Success|2021-02-03 10:31:11|2021-02-03 10:31:19|2021-02-03 10:22:29| 6|
|PN1|SN1| abc|Success|2021-02-03 10:31:21|2021-02-03 10:32:19| null| 1|
|PN1|SN2| abc| Fail|2021-02-03 10:22:49| |2021-02-03 10:22:49| 2|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
输出逻辑
成功事件发生在SN1记录的第四条记录中,成功发生在第四次尝试中,第一次尝试发生在“2021-02-03 10:20:11”。在最后一次失败的情况下,我们将结束日期保留为null并计数
是否有任何方法可以使用分析函数并返回,直到失败,即不包括成功
非常感谢您的帮助。基本上,我们需要一个窗口功能,然后在状态栏上滞后以从组中删除。按这些进行分组并应用您指出的逻辑,将在以下代码中产生结果:
from pyspark.sql.window import Window
from pyspark.sql import functions as F
from pyspark.sql import types as T
w = Window.partitionBy('PN', 'SN').orderBy('Claim_Start')
df.withColumn('lagged', F.lag('Status').over(w))\
.withColumn('status_flag', F.when(F.col('lagged') == 'Success', 1).otherwise(0))\
.withColumn('group', F.sum('status_flag').over(w))\
.groupBy('PN', 'SN', 'event','group').agg(F.last('Status').alias('status'),
F.when(F.last('Status') == 'Success', F.last('Claim_Start')).otherwise(F.first('Claim_Start')).alias('event_start_time'),
F.last('Claim_End').alias('event_end_time'),
F.when(F.count('Claim_Start') > 1, F.first('Claim_Start')).otherwise(None).alias('first_try'),
F.count('Claim_Start').alias('num_attempts')
).drop('group').orderBy('PN', 'SN', 'event_end_time').show()
导致
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
| PN| SN|event| status| event_start_time| event_end_time| first_try|num_attempts|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+
|PN1|SN1| abc| Fail|2021-02-03 11:32:49| |2021-02-03 11:32:49| 2|
|PN1|SN1| abc|Success|2021-02-03 10:21:11|2021-02-03 10:21:19|2021-02-03 10:20:11| 4|
|PN1|SN1| abc|Success|2021-02-03 10:31:11|2021-02-03 10:31:19|2021-02-03 10:22:19| 6|
|PN1|SN1| abc|Success|2021-02-03 10:31:21|2021-02-03 10:32:19| null| 1|
|PN1|SN2| abc| Fail|2021-02-03 10:22:49| |2021-02-03 10:22:49| 2|
+---+---+-----+-------+-------------------+-------------------+-------------------+------------+