pyspark:删除另一列的值的子字符串,该子字符串包含给定列的值中的正则表达式字符
假设我有一个像pyspark:删除另一列的值的子字符串,该子字符串包含给定列的值中的正则表达式字符,pyspark,databricks,azure-databricks,Pyspark,Databricks,Azure Databricks,假设我有一个像 df = spark.createDataFrame( [ ('Test1 This is a test Test2','This is a test'), ('That is','That') ], ['text','name']) +--------------------------+--------------+ |text |name | +-------------------
df = spark.createDataFrame(
[
('Test1 This is a test Test2','This is a test'),
('That is','That')
],
['text','name'])
+--------------------------+--------------+
|text |name |
+--------------------------+--------------+
|Test1 This is a test Test2|This is a test|
|That is |That |
+--------------------------+--------------+
如果我应用df.withColumn(“new”,F.expr(“regexp\u replace(text,name,”))))).show(truncate=False)
+--------------------------+--------------+------------+
|text |name |new |
+--------------------------+--------------+------------+
|Test1 This is a test Test2|This is a test|Test1 Test2|
|That is |That | is |
+--------------------------+--------------+------------+
假设我有以下数据帧
+-----------------------------+-----------------+
|text |name |
+-----------------------------+-----------------+
|Test1 This is a test(+1 Test2|This is a test(+1|
|That is |That |
+-----------------------------+-----------------+
如果我应用上面的命令,我会收到以下错误消息:
java.util.regex.PatternSyntaxException:悬挂元字符
“+'
我该怎么做才能使这个异常不会以最“pyspark”的方式发生,并保持文本中的值不变
感谢而不是
regexp\u replace
在spark中使用replace
功能
替换(str,search[,replace])-替换所有出现的搜索
用替换
示例:
df.show(10,False)
#+-----------------------------+-----------------+
#|text |name |
#+-----------------------------+-----------------+
#|Test1 This is a test(+1 Test2|This is a test(+1|
#|That is |That |
#+-----------------------------+-----------------+
df.withColumn("new",expr("replace(text,name,'')")).show(10,False)
#+-----------------------------+-----------------+------------+
#|text |name |new |
#+-----------------------------+-----------------+------------+
#|Test1 This is a test(+1 Test2|This is a test(+1|Test1 Test2|
#|That is |That | is |
#+-----------------------------+-----------------+------------+