Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python Pyspark regexp_replace with list元素未替换字符串_Python_Apache Spark_Pyspark - Fatal编程技术网

Python Pyspark regexp_replace with list元素未替换字符串

Python Pyspark regexp_replace with list元素未替换字符串,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我尝试使用regexp\u replace替换数据帧列中的字符串。我必须对dataframe列中的所有记录应用正则表达式模式。 但是字符串并没有像预期的那样被替换 from pyspark import SparkContext, SparkConf from pyspark.sql import SQLContext from pyspark import sql from pyspark.sql.functions import regexp_replace,col import re

我尝试使用regexp\u replace替换数据帧列中的字符串。我必须对dataframe列中的所有记录应用正则表达式模式。 但是字符串并没有像预期的那样被替换

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark import sql
from  pyspark.sql.functions import regexp_replace,col
import re

conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)


df=sc.parallelize([('2345','ADVANCED by John'),
('2398','ADVANCED by ADVANCE'),
('2328','Verified by somerandomtext'),
('3983','Double Checked by Marsha')]).toDF(['ID', "Notes"])

reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]

for i in range(len(reg_patterns)):
        res_split=re.findall(r"[^/]+",reg_patterns[i])
        res_split[0]
        df=df.withColumn('NotesUPD',regexp_replace(col('Notes'),res_split[0],res_split[1]))

df.show()
输出:

+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|    ADVANCED by John|
|2398| ADVANCED by ADVANCE| ADVANCED by ADVANCE|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

Expected Output:

+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|    ADV by John|
|2398| ADVANCED by ADVANCE|    ADV by ADV |
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

您应该编写一个
udf
函数并在
reg\u模式中循环,如下所示

reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]

import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
    res_split=[]
    for i in range(len(reg_patterns)):
        res_split=re.findall(r"[^/]+",reg_patterns[i])
        for x in res_split[0].split("|"):
            column = column.replace(x,res_split[1])
    return column

reg_replaceUdf = f.udf(replaceUdf, t.StringType())

df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()
你应该有

+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|         ADV by John|
|2398| ADVANCED by ADVANCE|          ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

您应该编写一个
udf
函数并在
reg\u模式中循环,如下所示

reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]

import re
from pyspark.sql import functions as f
from pyspark.sql import types as t
def replaceUdf(column):
    res_split=[]
    for i in range(len(reg_patterns)):
        res_split=re.findall(r"[^/]+",reg_patterns[i])
        for x in res_split[0].split("|"):
            column = column.replace(x,res_split[1])
    return column

reg_replaceUdf = f.udf(replaceUdf, t.StringType())

df = df.withColumn('NotesUPD', reg_replaceUdf(f.col('Notes')))
df.show()
你应该有

+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|         ADV by John|
|2398| ADVANCED by ADVANCE|          ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

问题是,您的代码从一开始就重复覆盖以前的结果。相反,您应该基于以前的结果:

notes_upd = col('Notes')

for i in range(len(reg_patterns)):
    res_split=re.findall(r"[^/]+",reg_patterns[i])
    res_split[0]
    notes_upd = regexp_replace(notes_upd, res_split[0],res_split[1])
您将获得所需的结果:

df.withColumn('NotesUPD', notes_upd).show()

# +----+--------------------+--------------------+
# |  ID|               Notes|            NotesUPD|
# +----+--------------------+--------------------+
# |2345|    ADVANCED by John|         ADV by John|
# |2398| ADVANCED by ADVANCE|          ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|         ADV by John|
|2398| ADVANCED by ADVANCE|          ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

问题是,您的代码从一开始就重复覆盖以前的结果。相反,您应该基于以前的结果:

notes_upd = col('Notes')

for i in range(len(reg_patterns)):
    res_split=re.findall(r"[^/]+",reg_patterns[i])
    res_split[0]
    notes_upd = regexp_replace(notes_upd, res_split[0],res_split[1])
您将获得所需的结果:

df.withColumn('NotesUPD', notes_upd).show()

# +----+--------------------+--------------------+
# |  ID|               Notes|            NotesUPD|
# +----+--------------------+--------------------+
# |2345|    ADVANCED by John|         ADV by John|
# |2398| ADVANCED by ADVANCE|          ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|         ADV by John|
|2398| ADVANCED by ADVANCE|          ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

以前的解决方案仅限于短长度reg_模式。当规范化模式有许多条目时(例如,使用自定义词典进行拼写更正),下面的实现可以很好地扩展

首先将reg_patterns列表映射到字典:

from pyspark.sql.functions import col, udf

def parse_string(s, or_delim, target_delim):
    keys,value = s.rstrip(target_delim).rsplit(target_delim)
    return {key:value for key in keys.split(or_delim)}

reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]

normalization_dict = {}
for item in reg_patterns:
    normalization_dict.update(parse_string(item, "|", "/"))
使用curried函数完成DataFrame“Notes”列的规范化,如下所示:

def my_norm_func(s, ngram_dict, pattern):
    return pattern.sub(lambda x: ngram_dict[x.group()], s)

norm_pattern = re.compile(r'\b(' + '|'.join([re.escape(item)\
                      for item in normalization_dict.keys()]) + r')\b')
my_norm_udf = udf(lambda s: my_norm_func(s, normalization_dict, norm_pattern))
df = df.withColumn("NotesUPD", my_norm_udf(col("Notes")))
产生以下预期结果:

df.withColumn('NotesUPD', notes_upd).show()

# +----+--------------------+--------------------+
# |  ID|               Notes|            NotesUPD|
# +----+--------------------+--------------------+
# |2345|    ADVANCED by John|         ADV by John|
# |2398| ADVANCED by ADVANCE|          ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|         ADV by John|
|2398| ADVANCED by ADVANCE|          ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

以前的解决方案仅限于短长度reg_模式。当规范化模式有许多条目时(例如,使用自定义词典进行拼写更正),下面的实现可以很好地扩展

首先将reg_patterns列表映射到字典:

from pyspark.sql.functions import col, udf

def parse_string(s, or_delim, target_delim):
    keys,value = s.rstrip(target_delim).rsplit(target_delim)
    return {key:value for key in keys.split(or_delim)}

reg_patterns=["ADVANCED|ADVANCE/ADV/","ASSOCS|AS|ASSOCIATES/ASSOC/"]

normalization_dict = {}
for item in reg_patterns:
    normalization_dict.update(parse_string(item, "|", "/"))
使用curried函数完成DataFrame“Notes”列的规范化,如下所示:

def my_norm_func(s, ngram_dict, pattern):
    return pattern.sub(lambda x: ngram_dict[x.group()], s)

norm_pattern = re.compile(r'\b(' + '|'.join([re.escape(item)\
                      for item in normalization_dict.keys()]) + r')\b')
my_norm_udf = udf(lambda s: my_norm_func(s, normalization_dict, norm_pattern))
df = df.withColumn("NotesUPD", my_norm_udf(col("Notes")))
产生以下预期结果:

df.withColumn('NotesUPD', notes_upd).show()

# +----+--------------------+--------------------+
# |  ID|               Notes|            NotesUPD|
# +----+--------------------+--------------------+
# |2345|    ADVANCED by John|         ADV by John|
# |2398| ADVANCED by ADVANCE|          ADV by ADV|
# |2328|Verified by somer...|Verified by somer...|
# |3983|Double Checked by...|Double Checked by...|
# +----+--------------------+--------------------+
+----+--------------------+--------------------+
|  ID|               Notes|            NotesUPD|
+----+--------------------+--------------------+
|2345|    ADVANCED by John|         ADV by John|
|2398| ADVANCED by ADVANCE|          ADV by ADV|
|2328|Verified by somer...|Verified by somer...|
|3983|Double Checked by...|Double Checked by...|
+----+--------------------+--------------------+

很高兴它是有用的很高兴它是helpful@marjun,此方法优于使用自定义项。您的原始代码中有一个简单的错误,此答案将修复该错误。@marjun,此方法优于使用udf。您的原始代码中有一个简单的错误,此答案将修复该错误。