Regex 使用Pyspark,两个字符之间的正则表达式获取文本和数字,但不获取日期

Regex 使用Pyspark,两个字符之间的正则表达式获取文本和数字,但不获取日期,regex,pyspark,Regex,Pyspark,使用Pyspark regex_extract()可以在字符串中的两个字符之间设置子字符串。它只是抓取文本和数字,而不是抓取日期 data = [('2345', '<Date>1999/12/12 10:00:05</Date>'), ('2398', '<Crew>crewIdXYZ</Crew>'), ('2328', '<Latitude>0.8252644369443788</Latitude>'),

使用Pyspark regex_extract()可以在字符串中的两个字符之间设置子字符串。它只是抓取文本和数字,而不是抓取日期

data = [('2345', '<Date>1999/12/12 10:00:05</Date>'),
('2398', '<Crew>crewIdXYZ</Crew>'),
('2328', '<Latitude>0.8252644369443788</Latitude>'),        
('3983', '<Longitude>-2.1915840465066916<Longitude>')]

df = sc.parallelize(data).toDF(['ID', 'values'])

df.show(truncate=False)

+----+-----------------------------------------+
|ID  |values                                   |
+----+-----------------------------------------+
|2345|<Date>1999/12/12 10:00:05</Date>         |
|2398|<Crew>crewIdXYZ</Crew>                   |
|2328|<Latitude>0.8252644369443788</Latitude>  |
|3983|<Longitude>-2.1915840465066916<Longitude>|
+----+-----------------------------------------+
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^<:]+(?=:?<))', 2)) 
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID  |values                                   |vals               |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date>         |                   |
|2398|<Crew>crewIdXYZ</Crew>                   |crewIdXYZ          |
|2328|<Latitude>0.8252644369443788</Latitude>  |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
data=[('2345','1999/12/12 10:00:05'),
('2398','crewIdXYZ'),
('2328', '0.8252644369443788'),        
('3983', '-2.1915840465066916')]
df=sc.parallelize(data).toDF(['ID','values'])
df.show(truncate=False)
+----+-----------------------------------------+
|ID |值|
+----+-----------------------------------------+
|2345|1999/12/12 10:00:05         |
|2398 | crewIdXYZ|
|2328|0.8252644369443788  |
|3983|-2.1915840465066916|
+----+-----------------------------------------+

df_2=df.withColumn('vals',regexp_extract(col('values'),'(?)[^@jxc谢谢。下面是它的工作原理:

df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^>]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID  |values                                   |vals               |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date>         |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew>                   |crewIdXYZ          |
|2328|<Latitude>0.8252644369443788</Latitude>  |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
df_2=df.withColumn('vals',regexp_extract(col('values'),'()((?)[^>])+(?=:?您可以使用

>([^<>]+)<

只需删除标记:
df.withColumn('vals',regexp_replace('values',']*>','')
您对模式进行了大量的模糊处理:
(?)
等于
[^>]+
是贪婪的,与
或字符串结尾匹配,并回溯到
Justin理想情况下,它不应该是答案,而应该是@Wiktor上的评论或对问题评论的回答或回复
df_2 = df.withColumn('vals', regexp_extract(col('values'), '>([^<>]+)<', 1))
df_2.show(truncate=False)

+----+-----------------------------------------+-------------------+
|ID  |values                                   |vals               |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date>         |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew>                   |crewIdXYZ          |
|2328|<Latitude>0.8252644369443788</Latitude>  |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+