Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
向dataFrame列值添加单引号_Dataframe_Apache Spark_Pyspark_Databricks - Fatal编程技术网

向dataFrame列值添加单引号

向dataFrame列值添加单引号,dataframe,apache-spark,pyspark,databricks,Dataframe,Apache Spark,Pyspark,Databricks,DataFrame包含一列QUALIFY,其值如下所示 QUALIFY ================= ColA|ColB|ColC ColA ColZ|ColP 此列中的值按“|”拆分。我希望此列中的值类似于'ColA'、'ColB'、'ColC'… 使用下面的代码,我可以将|替换为,',。如何在值的开头和结尾添加单个报价 newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\\|", "\\','")

DataFrame
包含一列
QUALIFY
,其值如下所示

QUALIFY
=================
ColA|ColB|ColC
ColA
ColZ|ColP
此列中的值按
“|”
拆分。我希望此列中的值类似于
'ColA'、'ColB'、'ColC'…

使用下面的代码,我可以将
|
替换为
,',
。如何在值的开头和结尾添加单个报价

newDf = df_qualify.withColumn('QUALIFY2', regexp_replace('QUALIFY', "\\|", "\\','"))

拆分
|
上的列,然后将生成的数组连接回字符串:

import pyspark.sql.functions as F
import pyspark.sql.types as T

def str_list(x):
    return str(x).replace("[", "").replace("]", "")

str_udf = F.udf(str_list, T.StringType())

df = df.withColumn("arr_split", F.split(F.col("QUALIFY"), "\|")) # escape character
df = df.withColumn("QUALIFY2", str_udf(F.col("arr_split")))
我的示例输出帧:

df.drop("arr_split").show() # Please ignore a and b columns
+---+---+--------------+--------------------+
|  a|  b|           abc|            QUALIFY2|
+---+---+--------------+--------------------+
|  1|  1|col1|col2|col3|'col1', 'col2', '...|
|  2|  2|col1|col2|col3|'col1', 'col2', '...|
|  3|  3|col1|col2|col3|'col1', 'col2', '...|
|  4|  4|col1|col2|col3|'col1', 'col2', '...|
|  5|  5|col1|col2|col3|'col1', 'col2', '...|
+---+---+--------------+--------------------+

您的解决方案就快到了-您只需要在开始和结束处添加一个报价。您可以通过以下方式实现此目的:

或者,您可以避免使用正则表达式,并使用和实现相同的效果:


为什么不先在
|
上拆分它,然后将生成的数组连接回字符串?
from pyspark.sql.functions import col, concat, lit, regexp_replace

df.withColumn(
    "QUALIFY2",
    concat(lit("'"), regexp_replace(col('QUALIFY'), r"\|", r"','"), lit("'"))
).show()
#+--------------+--------------------+
#|       QUALIFY|            QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#|          ColA|              'ColA'|
#|     ColZ|ColP|       'ColZ','ColP'|
#+--------------+--------------------+
from pyspark.sql.functions import split, concat_ws
df.withColumn(
    "QUALIFY2", 
    concat(lit("'"), concat_ws("','", split("QUALIFY", "\|")), lit("'"))
).show()
#+--------------+--------------------+
#|       QUALIFY|            QUALIFY2|
#+--------------+--------------------+
#|ColA|ColB|ColC|'ColA','ColB','ColC'|
#|          ColA|              'ColA'|
#|     ColZ|ColP|       'ColZ','ColP'|
#+--------------+--------------------+