Python Pyspark regexp_replace with list元素未替换字符串
我尝试使用regexp\u replace替换数据帧列中的字符串。我必须对dataframe列中的所有记录应用正则表达式模式。 但是字符串并没有像预期的那样被替换Python Pyspark regexp_replace with list元素未替换字符串,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我尝试使用regexp\u replace替换数据帧列中的字符串。我必须对dataframe列中的所有记录应用正则表达式模式。 但是字符串并没有像预期的那样被替换 from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext from pyspark import sql from pyspark.sql.functions import regexp_replace,col import re
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql
from pyspark.sql.functions import regexp_replace,col
import re
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)
df=sc.parallelize([('2345','ADVANCED by John'),
('2398','ADVANCED by ADVANCE'),
('2328','Verified by somerandomtext'),
('3983','Double Checked by Marsha')]).toDF(['ID', "Notes"])
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
res_split[0]
df=df.withColumn('NotesUPD',regexp_replace(col('Notes'),res_split[0],res_split[1]))
df.show()
输出:
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADVANCED by John|
|2398| ADVANCED by ADVANCE| ADVANCED by ADVANCE|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
Expected Output:
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV |
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
您应该编写一个
udf
函数并在reg\u模式中循环,如下所示
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
res_split=[]
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
for x in res_split[0].split("|"):
column = column.replace(x,res_split[1])
return column
reg_replaceUdf = f.udf(replaceUdf, t.StringType())
df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()
你应该有
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
您应该编写一个udf
函数并在reg\u模式中循环,如下所示
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
res_split=[]
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
for x in res_split[0].split("|"):
column = column.replace(x,res_split[1])
return column
reg_replaceUdf = f.udf(replaceUdf, t.StringType())
df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()
你应该有
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
问题是,您的代码从一开始就重复覆盖以前的结果。相反,您应该基于以前的结果:
notes_upd = col('Notes')
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
res_split[0]
notes_upd = regexp_replace(notes_upd, res_split[0],res_split[1])
您将获得所需的结果:
df.withColumn('NotesUPD', notes_upd).show()
# +----+--------------------+--------------------+
# | ID| Notes| NotesUPD|
# +----+--------------------+--------------------+
# |2345| ADVANCED by John| ADV by John|
# |2398| ADVANCED by ADVANCE| ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
问题是,您的代码从一开始就重复覆盖以前的结果。相反,您应该基于以前的结果:
notes_upd = col('Notes')
for i in range(len(reg_patterns)):
res_split=re.findall(r"[^/]+",reg_patterns[i])
res_split[0]
notes_upd = regexp_replace(notes_upd, res_split[0],res_split[1])
您将获得所需的结果:
df.withColumn('NotesUPD', notes_upd).show()
# +----+--------------------+--------------------+
# | ID| Notes| NotesUPD|
# +----+--------------------+--------------------+
# |2345| ADVANCED by John| ADV by John|
# |2398| ADVANCED by ADVANCE| ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
以前的解决方案仅限于短长度reg_模式。当规范化模式有许多条目时(例如,使用自定义词典进行拼写更正),下面的实现可以很好地扩展
首先将reg_patterns列表映射到字典:
from pyspark.sql.functions import col, udf
def parse_string(s, or_delim, target_delim):
keys,value = s.rstrip(target_delim).rsplit(target_delim)
return {key:value for key in keys.split(or_delim)}
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
normalization_dict = {}
for item in reg_patterns:
normalization_dict.update(parse_string(item, "|", "/"))
使用curried函数完成DataFrame“Notes”列的规范化,如下所示:
def my_norm_func(s, ngram_dict, pattern):
return pattern.sub(lambda x: ngram_dict[x.group()], s)
norm_pattern = re.compile(r'\b(' + '|'.join([re.escape(item)\
for item in normalization_dict.keys()]) + r')\b')
my_norm_udf = udf(lambda s: my_norm_func(s, normalization_dict, norm_pattern))
df = df.withColumn("NotesUPD", my_norm_udf(col("Notes")))
产生以下预期结果:
df.withColumn('NotesUPD', notes_upd).show()
# +----+--------------------+--------------------+
# | ID| Notes| NotesUPD|
# +----+--------------------+--------------------+
# |2345| ADVANCED by John| ADV by John|
# |2398| ADVANCED by ADVANCE| ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
以前的解决方案仅限于短长度reg_模式。当规范化模式有许多条目时(例如,使用自定义词典进行拼写更正),下面的实现可以很好地扩展
首先将reg_patterns列表映射到字典:
from pyspark.sql.functions import col, udf
def parse_string(s, or_delim, target_delim):
keys,value = s.rstrip(target_delim).rsplit(target_delim)
return {key:value for key in keys.split(or_delim)}
reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]
normalization_dict = {}
for item in reg_patterns:
normalization_dict.update(parse_string(item, "|", "/"))
使用curried函数完成DataFrame“Notes”列的规范化,如下所示:
def my_norm_func(s, ngram_dict, pattern):
return pattern.sub(lambda x: ngram_dict[x.group()], s)
norm_pattern = re.compile(r'\b(' + '|'.join([re.escape(item)\
for item in normalization_dict.keys()]) + r')\b')
my_norm_udf = udf(lambda s: my_norm_func(s, normalization_dict, norm_pattern))
df = df.withColumn("NotesUPD", my_norm_udf(col("Notes")))
产生以下预期结果:
df.withColumn('NotesUPD', notes_upd).show()
# +----+--------------------+--------------------+
# | ID| Notes| NotesUPD|
# +----+--------------------+--------------------+
# |2345| ADVANCED by John| ADV by John|
# |2398| ADVANCED by ADVANCE| ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
| ID| Notes| NotesUPD|
+----+--------------------+--------------------+
|2345| ADVANCED by John| ADV by John|
|2398| ADVANCED by ADVANCE| ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+
很高兴它是有用的很高兴它是helpful@marjun,此方法优于使用自定义项。您的原始代码中有一个简单的错误,此答案将修复该错误。@marjun,此方法优于使用udf。您的原始代码中有一个简单的错误,此答案将修复该错误。