Python 将PySpark dataframe列从列表转换为字符串
我有这个Pypark数据框Python 将PySpark dataframe列从列表转换为字符串,python,apache-spark,pyspark,apache-spark-sql,pyspark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,Pyspark Sql,我有这个Pypark数据框 +-----------+--------------------+ |uuid | test_123 | +-----------+--------------------+ | 1 |[test, test2, test3]| | 2 |[test4, test, test6]| | 3 |[test6, test9, t55o]| 我想将列test_123转换为如下: +-
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |[test, test2, test3]|
| 2 |[test4, test, test6]|
| 3 |[test6, test9, t55o]|
我想将列test_123
转换为如下:
+-----------+--------------------+
|uuid | test_123 |
+-----------+--------------------+
| 1 |"test,test2,test3" |
| 2 |"test4,test,test6" |
| 3 |"test6,test9,t55o" |
所以从列表中选择字符串
我怎样才能用Pypark做到这一点 您可以创建一个连接数组/列表的
udf
,然后将其应用于测试列:
初始数据帧由以下内容创建:
from pyspark.sql.types import StructType, StructField
schema = StructType([StructField("uuid",IntegerType(),True),StructField("test_123",ArrayType(StringType(),True),True)])
rdd = sc.parallelize([[1, ["test","test2","test3"]], [2, ["test4","test","test6"]],[3,["test6","test9","t55o"]]])
df = spark.createDataFrame(rdd, schema)
df.show()
+----+--------------------+
|uuid| test_123|
+----+--------------------+
| 1|[test, test2, test3]|
| 2|[test4, test, test6]|
| 3|[test6, test9, t55o]|
+----+--------------------+
虽然您可以使用
UserDefinedFunction
,但它的效率非常低。相反,最好使用concat\ws
函数:
from pyspark.sql.functions import concat_ws
df.withColumn("test_123", concat_ws(",", "test_123")).show()
+----+----------------+
|uuid测试123|
+----+----------------+
|1 |测试,测试2,测试3|
|2 |测试4,测试6|
|3 |试验6、试验9、t55o|
+----+----------------+
从2.4.0版开始,您可以使用数组\u-join
从pyspark.sql.functions导入数组
df.withColumn(“test_123”,array_join(“test_123)”,“,”).show()
这应该是imho接受的答案。
from pyspark.sql.functions import concat_ws
df.withColumn("test_123", concat_ws(",", "test_123")).show()