Apache spark 提取spark数据帧中子字符串后第一个出现的字符串-PySpark
我有一个pyspark数据帧(df),它有一个时间戳列和一个消息列(数据类型str),如下所示: 示例数据帧Apache spark 提取spark数据帧中子字符串后第一个出现的字符串-PySpark,apache-spark,pyspark,apache-spark-sql,pyspark-dataframes,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Dataframes,我有一个pyspark数据帧(df),它有一个时间戳列和一个消息列(数据类型str),如下所示: 示例数据帧 message time_stamp some irrelevant text 2015-01-23 08:27:18 irrelevant_text start : string
message time_stamp
some irrelevant text 2015-01-23 08:27:18
irrelevant_text start : string 2015-01-23 08:27:34
contributor Id :XYZ_ABCD 2015-01-23 08:27:54
some irrelevant text 2015-01-23 08:28:36
contributor Id :XYZ_ABCD 2015-01-23 08:28:55
some irrelevant text 2015-01-23 08:29:36
contributor Id :MNOP_xyz 2015-01-23 08:29:45
some irrelevant text 2015-01-23 08:29:30
irrelevant_text end : string 2015-01-23 08:30:47
some irrelevant text 2015-01-23 08:30:59
irrelevant_text start : string 2015-01-23 08:31:34
contributor Id :EFG_A 2015-01-23 08:31:54
some irrelevant text 2015-01-23 08:32:05
contributor Id :pqr_wx 2015-01-23 08:32:15
some irrelevant text 2015-01-23 08:32:26
contributor Id :pqr_wx 2015-01-23 08:33:01
some irrelevant text 2015-01-23 08:33:09
irrelevant_text end : string 2015-01-23 08:40:34
some irrelevant text 2015-01-23 08:40:47
irrelevant_text start : string 2015-01-23 09:31:34
contributor Id :lmo_uvw 2015-01-23 09:31:54
some irrelevant text 2015-01-23 09:32:05
contributor Id :xlr_mot 2015-01-23 09:32:15
some irrelevant text 2015-01-23 09:32:26
irrelevant_text end : string 2015-01-23 09:40:34
some irrelevant text 2015-01-23 09:40:47
我希望提取contributor Id:之后在start:string和end:string之间的字符串的第一次出现,以及那些仅出现一次的contributor Id:,并丢弃不是第一次出现的Id。一个日期中可能有多个这样的实例
预期输出:
time_stamp ID
2015-01-23 08:27:54 XYZ_ABCD
2015-01-23 08:29:45 MNOP_xyz
2015-01-23 08:31:54 EFG_A
2015-01-23 08:32:15 pqr_wx
2015-01-23 08:31:54 lmo_uvw
2015-01-23 08:32:15 xlr_mot
我们非常感谢在这方面提供的帮助。谢谢为ID和开始/结束时间戳的每个分区分配一个行号,并使用行号1筛选行
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'begin',
F.last(
F.when(F.col('message').rlike('start'), F.col('time_stamp')), True
).over(Window.orderBy('time_stamp'))
).withColumn(
'end',
F.first(
F.when(F.col('message').rlike('end'), F.col('time_stamp')), True
).over(Window.orderBy('time_stamp').rowsBetween(0, Window.unboundedFollowing))
).withColumn(
'ID',
F.regexp_extract('message', 'contributor Id :(\S+)', 1)
).filter(
"ID != '' and begin is not null and end is not null"
).withColumn(
'rn',
F.row_number().over(Window.partitionBy('ID', 'begin', 'end').orderBy('time_stamp'))
).filter(
'rn = 1'
).select(
'time_stamp', 'ID'
).orderBy('time_stamp')
df2.show()
+-------------------+--------+
| time_stamp| ID|
+-------------------+--------+
|2015-01-23 08:27:54|XYZ_ABCD|
|2015-01-23 08:29:45|MNOP_xyz|
|2015-01-23 08:31:54| EFG_A|
|2015-01-23 08:32:15| pqr_wx|
|2015-01-23 09:31:54| lmo_uvw|
|2015-01-23 09:32:15| xlr_mot|
+-------------------+--------+
谢谢你的回答。我得到一个空白的结果数据框。我想是因为身份证的关系。我们可以私下聊聊吗?谢谢