Python 从字符串PySpark Dataframe列中删除正则表达式
我需要从pyspark数据帧中的字符串列中删除正则表达式Python 从字符串PySpark Dataframe列中删除正则表达式,python,regex,pyspark,apache-spark-sql,Python,Regex,Pyspark,Apache Spark Sql,我需要从pyspark数据帧中的字符串列中删除正则表达式 df= spark.createDataFrame([("Dog 10H03", "10H03"), ("Cat 09H24 eats rat", "09H24"), ("Mouse 09H45 runs away", "09H45"), ("Mouse 09H45 enters room", "09H45")],["Animal", "Time"]) 时间戳(例如10H03)是必须删除的正则表达式 +-----------------
df= spark.createDataFrame([("Dog 10H03", "10H03"), ("Cat 09H24 eats rat", "09H24"), ("Mouse 09H45 runs away", "09H45"), ("Mouse 09H45 enters room", "09H45")],["Animal", "Time"])
时间戳(例如10H03)是必须删除的正则表达式
+--------------------+------------------+-----+
| Animal| Animal_strip_time| Time|
+--------------------+------------------+-----+
| Dog 10H03| Dog |10H03|
| Cat 09H24 eats rat| Cat eats rat|09H24|
|Mouse 09H45 runs ...| Mouse runs away|09H45|
|Mouse 09H45 enter...|Mouse enters room|09H45|
+--------------------+------------------+-----+
Time
列中的时间戳可能与Animal
列中的时间戳不同。因此,它不能用于匹配字符串
正则表达式应遵循XXHXX的模式,其中X是介于0-9之间的数字。这应该可以完成以下工作:
from pyspark.sql import functions as F
df.withColumn('Animal_strip_time', F.regexp_replace('Animal', '\d\dH\d\d', ''))
如果
时间
可能不同且不能使用,为什么要将其包含在问题中?