Python 从字符串PySpark Dataframe列中删除正则表达式

Python 从字符串PySpark Dataframe列中删除正则表达式,python,regex,pyspark,apache-spark-sql,Python,Regex,Pyspark,Apache Spark Sql,我需要从pyspark数据帧中的字符串列中删除正则表达式 df= spark.createDataFrame([("Dog 10H03", "10H03"), ("Cat 09H24 eats rat", "09H24"), ("Mouse 09H45 runs away", "09H45"), ("Mouse 09H45 enters room", "09H45")],["Animal", "Time"]) 时间戳(例如10H03)是必须删除的正则表达式 +-----------------

我需要从pyspark数据帧中的字符串列中删除正则表达式

df= spark.createDataFrame([("Dog 10H03", "10H03"), ("Cat 09H24 eats rat", "09H24"), ("Mouse 09H45 runs away", "09H45"), ("Mouse 09H45 enters room", "09H45")],["Animal", "Time"])
时间戳(例如10H03)是必须删除的正则表达式

+--------------------+------------------+-----+
|              Animal| Animal_strip_time| Time|
+--------------------+------------------+-----+
|           Dog 10H03|              Dog |10H03|
|  Cat 09H24 eats rat|     Cat  eats rat|09H24|
|Mouse 09H45 runs ...|  Mouse  runs away|09H45|
|Mouse 09H45 enter...|Mouse  enters room|09H45|
+--------------------+------------------+-----+
Time
列中的时间戳可能与
Animal
列中的时间戳不同。因此,它不能用于匹配字符串


正则表达式应遵循XXHXX的模式,其中X是介于0-9之间的数字。这应该可以完成以下工作:

from pyspark.sql import functions as F
df.withColumn('Animal_strip_time', F.regexp_replace('Animal', '\d\dH\d\d', ''))

如果
时间
可能不同且不能使用,为什么要将其包含在问题中?