Apache spark 提取spark数据帧中子字符串后第一个出现的字符串-PySpark

Apache spark 提取spark数据帧中子字符串后第一个出现的字符串-PySpark,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个pyspark数据帧(df),它有一个时间戳列和一个消息列(数据类型str),如下所示: 示例数据帧 message time_stamp some irrelevant text 2015-01-23 08:27:18 irrelevant_text start : string

我有一个pyspark数据帧(df),它有一个时间戳列和一个消息列(数据类型str),如下所示:

示例数据帧

            
 message                                                   time_stamp
some irrelevant text                                    2015-01-23 08:27:18
irrelevant_text start  : string                         2015-01-23 08:27:34
contributor Id :XYZ_ABCD                                2015-01-23 08:27:54
some irrelevant text                                    2015-01-23 08:28:36
contributor Id :XYZ_ABCD                                2015-01-23 08:28:55
some irrelevant text                                    2015-01-23 08:29:36
contributor Id :MNOP_xyz                                2015-01-23 08:29:45
some irrelevant text                                    2015-01-23 08:29:30
irrelevant_text end : string                            2015-01-23 08:30:47
some irrelevant text                                    2015-01-23 08:30:59
irrelevant_text start  : string                         2015-01-23 08:31:34
contributor Id :EFG_A                                   2015-01-23 08:31:54
some irrelevant text                                    2015-01-23 08:32:05
contributor Id :pqr_wx                                  2015-01-23 08:32:15
some irrelevant text                                    2015-01-23 08:32:26
contributor Id :pqr_wx                                  2015-01-23 08:33:01
some irrelevant text                                    2015-01-23 08:33:09
irrelevant_text end : string                            2015-01-23 08:40:34
some irrelevant text                                    2015-01-23 08:40:47
irrelevant_text start  : string                         2015-01-23 09:31:34
contributor Id :lmo_uvw                                 2015-01-23 09:31:54
some irrelevant text                                    2015-01-23 09:32:05
contributor Id :xlr_mot                                 2015-01-23 09:32:15
some irrelevant text                                    2015-01-23 09:32:26
irrelevant_text end : string                            2015-01-23 09:40:34
some irrelevant text                                    2015-01-23 09:40:47
我希望提取contributor Id:之后start:stringend:string之间的字符串的第一次出现,以及那些仅出现一次的contributor Id:,并丢弃不是第一次出现的Id。一个日期中可能有多个这样的实例

预期输出:

time_stamp                ID
2015-01-23 08:27:54     XYZ_ABCD                                
2015-01-23 08:29:45     MNOP_xyz
2015-01-23 08:31:54     EFG_A                                   
2015-01-23 08:32:15     pqr_wx
2015-01-23 08:31:54     lmo_uvw  
2015-01-23 08:32:15     xlr_mot                                                                                                  

我们非常感谢在这方面提供的帮助。谢谢

为ID和开始/结束时间戳的每个分区分配一个行号,并使用行号1筛选行

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'begin',
    F.last(
        F.when(F.col('message').rlike('start'), F.col('time_stamp')), True
    ).over(Window.orderBy('time_stamp'))
).withColumn(
    'end',
    F.first(
        F.when(F.col('message').rlike('end'), F.col('time_stamp')), True
    ).over(Window.orderBy('time_stamp').rowsBetween(0, Window.unboundedFollowing))
).withColumn(
    'ID',
    F.regexp_extract('message', 'contributor Id :(\S+)', 1)
).filter(
    "ID != '' and begin is not null and end is not null"
).withColumn(
    'rn',
    F.row_number().over(Window.partitionBy('ID', 'begin', 'end').orderBy('time_stamp'))
).filter(
    'rn = 1'
).select(
    'time_stamp', 'ID'
).orderBy('time_stamp')

df2.show()
+-------------------+--------+
|         time_stamp|      ID|
+-------------------+--------+
|2015-01-23 08:27:54|XYZ_ABCD|
|2015-01-23 08:29:45|MNOP_xyz|
|2015-01-23 08:31:54|   EFG_A|
|2015-01-23 08:32:15|  pqr_wx|
|2015-01-23 09:31:54| lmo_uvw|
|2015-01-23 09:32:15| xlr_mot|
+-------------------+--------+

谢谢你的回答。我得到一个空白的结果数据框。我想是因为身份证的关系。我们可以私下聊聊吗?谢谢