Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/python/344.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Python 将PySpark dataframe列从列表转换为字符串_Python_Apache Spark_Pyspark_Apache Spark Sql_Pyspark Sql - Fatal编程技术网

Python 将PySpark dataframe列从列表转换为字符串

Python 将PySpark dataframe列从列表转换为字符串,python,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有这个Pypark数据框 +-----------+--------------------+ |uuid | test_123 | +-----------+--------------------+ | 1 |[test, test2, test3]| | 2 |[test4, test, test6]| | 3 |[test6, test9, t55o]| 我想将列test_123转换为如下: +-

我有这个Pypark数据框

+-----------+--------------------+
|uuid       |   test_123         |    
+-----------+--------------------+
|      1    |[test, test2, test3]|
|      2    |[test4, test, test6]|
|      3    |[test6, test9, t55o]|
我想将列
test_123
转换为如下:

+-----------+--------------------+
|uuid       |   test_123         |    
+-----------+--------------------+
|      1    |"test,test2,test3"  |
|      2    |"test4,test,test6"  |
|      3    |"test6,test9,t55o"  |
所以从列表中选择字符串


我怎样才能用Pypark做到这一点

您可以创建一个连接数组/列表的
udf
,然后将其应用于测试列:

初始数据帧由以下内容创建:

from pyspark.sql.types import StructType, StructField
schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
df = spark.createDataFrame(rdd, schema)

df.show()
+----+--------------------+
|uuid|            test_123|
+----+--------------------+
|   1|[test, test2, test3]|
|   2|[test4, test, test6]|
|   3|[test6, test9, t55o]|
+----+--------------------+

虽然您可以使用
UserDefinedFunction
,但它的效率非常低。相反,最好使用
concat\ws
函数:

from pyspark.sql.functions import concat_ws

df.withColumn("test_123", concat_ws(",", "test_123")).show()
+----+----------------+
|uuid测试123|
+----+----------------+
|1 |测试,测试2,测试3|
|2 |测试4,测试6|
|3 |试验6、试验9、t55o|
+----+----------------+

从2.4.0版开始,您可以使用
数组\u-join


从pyspark.sql.functions导入数组
df.withColumn(“test_123”,array_join(“test_123)”,“,”).show()

这应该是imho接受的答案。
from pyspark.sql.functions import concat_ws

df.withColumn("test_123", concat_ws(",", "test_123")).show()