Regex 使用Pyspark,两个字符之间的正则表达式获取文本和数字,但不获取日期
使用Pyspark regex_extract()可以在字符串中的两个字符之间设置子字符串。它只是抓取文本和数字,而不是抓取日期Regex 使用Pyspark,两个字符之间的正则表达式获取文本和数字,但不获取日期,regex,pyspark,Regex,Pyspark,使用Pyspark regex_extract()可以在字符串中的两个字符之间设置子字符串。它只是抓取文本和数字,而不是抓取日期 data = [('2345', '<Date>1999/12/12 10:00:05</Date>'), ('2398', '<Crew>crewIdXYZ</Crew>'), ('2328', '<Latitude>0.8252644369443788</Latitude>'),
data = [('2345', '<Date>1999/12/12 10:00:05</Date>'),
('2398', '<Crew>crewIdXYZ</Crew>'),
('2328', '<Latitude>0.8252644369443788</Latitude>'),
('3983', '<Longitude>-2.1915840465066916<Longitude>')]
df = sc.parallelize(data).toDF(['ID', 'values'])
df.show(truncate=False)
+----+-----------------------------------------+
|ID |values |
+----+-----------------------------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |
|2398|<Crew>crewIdXYZ</Crew> |
|2328|<Latitude>0.8252644369443788</Latitude> |
|3983|<Longitude>-2.1915840465066916<Longitude>|
+----+-----------------------------------------+
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^<:]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> | |
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
data=[('2345','1999/12/12 10:00:05'),
('2398','crewIdXYZ'),
('2328', '0.8252644369443788'),
('3983', '-2.1915840465066916')]
df=sc.parallelize(data).toDF(['ID','values'])
df.show(truncate=False)
+----+-----------------------------------------+
|ID |值|
+----+-----------------------------------------+
|2345|1999/12/12 10:00:05 |
|2398 | crewIdXYZ|
|2328|0.8252644369443788 |
|3983|-2.1915840465066916|
+----+-----------------------------------------+
df_2=df.withColumn('vals',regexp_extract(col('values'),'(?)[^@jxc谢谢。下面是它的工作原理:
df_2 = df.withColumn('vals', regexp_extract(col('values'), '(.)((?<=>)[^>]+(?=:?<))', 2))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+
df_2=df.withColumn('vals',regexp_extract(col('values'),'()((?)[^>])+(?=:?您可以使用
>([^<>]+)<
只需删除标记:df.withColumn('vals',regexp_replace('values',']*>','')
您对模式进行了大量的模糊处理:(?)
等于
[^>]+
是贪婪的,与
或字符串结尾匹配,并回溯到Justin理想情况下,它不应该是答案,而应该是@Wiktor上的评论或对问题评论的回答或回复
df_2 = df.withColumn('vals', regexp_extract(col('values'), '>([^<>]+)<', 1))
df_2.show(truncate=False)
+----+-----------------------------------------+-------------------+
|ID |values |vals |
+----+-----------------------------------------+-------------------+
|2345|<Date>1999/12/12 10:00:05</Date> |1999/12/12 10:00:05|
|2398|<Crew>crewIdXYZ</Crew> |crewIdXYZ |
|2328|<Latitude>0.8252644369443788</Latitude> |0.8252644369443788 |
|3983|<Longitude>-2.1915840465066916<Longitude>|-2.1915840465066916|
+----+-----------------------------------------+-------------------+