向dataFrame列值添加单引号
向dataFrame列值添加单引号,dataframe,apache-spark,pyspark,databricks,Dataframe,Apache Spark,Pyspark,Databricks,DataFrame包含一列QUALIFY,其值如下所示 QUALIFY ================= ColA|ColB|ColC ColA ColZ|ColP 此列中的值按“|”拆分。我希望此列中的值类似于'ColA'、'ColB'、'ColC'… 使用下面的代码,我可以将|替换为,',。如何在值的开头和结尾添加单个报价 newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\\|", "\\','")
DataFrame
包含一列QUALIFY
,其值如下所示
QUALIFY
=================
ColA|ColB|ColC
ColA
ColZ|ColP
此列中的值按“|”
拆分。我希望此列中的值类似于'ColA'、'ColB'、'ColC'…
使用下面的代码,我可以将|
替换为,',
。如何在值的开头和结尾添加单个报价
newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\\|", "\\','"))
拆分
|
上的列,然后将生成的数组连接回字符串:
import pyspark.sql.functions as F
import pyspark.sql.types as T
def str_list(x):
return str(x).replace("[", "").replace("]", "")
str_udf = F.udf(str_list, T.StringType())
df = df.withColumn("arr_split", F.split(F.col("QUALIFY"), "\|")) # escape character
df = df.withColumn("QUALIFY2", str_udf(F.col("arr_split")))
我的示例输出帧:
df.drop("arr_split").show() # Please ignore a and b columns
+---+---+--------------+--------------------+
| a| b| abc| QUALIFY2|
+---+---+--------------+--------------------+
| 1| 1|col1|col2|col3|'col1', 'col2', '...|
| 2| 2|col1|col2|col3|'col1', 'col2', '...|
| 3| 3|col1|col2|col3|'col1', 'col2', '...|
| 4| 4|col1|col2|col3|'col1', 'col2', '...|
| 5| 5|col1|col2|col3|'col1', 'col2', '...|
+---+---+--------------+--------------------+
您的解决方案就快到了-您只需要在开始和结束处添加一个报价。您可以通过以下方式实现此目的: 或者,您可以避免使用正则表达式,并使用和实现相同的效果:
为什么不先在
|
上拆分它,然后将生成的数组连接回字符串?
from pyspark.sql.functions import col, concat, lit, regexp_replace
df.withColumn(
"QUALIFY2",
concat(lit("'"), regexp_replace(col('QUALIFY'), r"\|", r"','"), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+
from pyspark.sql.functions import split, concat_ws
df.withColumn(
"QUALIFY2",
concat(lit("'"), concat_ws("','", split("QUALIFY", "\|")), lit("'"))
).show()
#+--------------+--------------------+
#| QUALIFY| QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#| ColA| 'ColA'|
#| ColZ|ColP| 'ColZ','ColP'|
#+--------------+--------------------+