使用scala替换spark dataframe列中多次出现的字符串的正则表达式_Scala_Apache Spark_Apache Spark Sql

使用scala替换spark dataframe列中多次出现的字符串的正则表达式

scala apache-spark

使用scala替换spark dataframe列中多次出现的字符串的正则表达式,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个列，其中一个特定字符串多次出现。发生的次数不是固定的。我可以多次得到这样的字符串例如：列描述包含以下数据已为该帐户和该帐户取消该帐户这里基本上我想用单个帐户替换多个并发帐户预期产出：已为该帐户取消该帐户，并且已与该帐户一起取消该帐户您可以使用regex模式（源代码：）和regexp\u replace替换重复的单词： val df = spark.sql("select 'The account account has been cancelled for th

我有一个列，其中一个特定字符串多次出现。发生的次数不是固定的。我可以多次得到这样的字符串

例如：列描述包含以下数据

已为该帐户和该帐户取消该帐户

这里基本上我想用单个帐户替换多个并发帐户

预期产出：

已为该帐户取消该帐户，并且已与该帐户一起取消该帐户

您可以使用regex模式（源代码：）和

regexp\u replace

替换重复的单词：

val df = spark.sql("select 'The account account has been cancelled for the account account account and with the account' col")

df.show(false)
+-------------------------------------------------------------------------------------------+
|col                                                                                        |
+-------------------------------------------------------------------------------------------+
|The account account has been cancelled for the account account account and with the account|
+-------------------------------------------------------------------------------------------+

val df2 = df.withColumn("col", regexp_replace(col("col"), "\\b(\\w+)(\\b\\W+\\b\\1\\b)*", "$1"))

df2.show(false)
+-------------------------------------------------------------------+
|col                                                                |
+-------------------------------------------------------------------+
|The account has been cancelled for the account and with the account|
+-------------------------------------------------------------------+