将具有嵌套结构的数组与PySpark数据框中的其他列一起转换为字符串列_Pyspark_Pyspark Sql

将具有嵌套结构的数组与PySpark数据框中的其他列一起转换为字符串列

pyspark

将具有嵌套结构的数组与PySpark数据框中的其他列一起转换为字符串列,pyspark,pyspark-sql,Pyspark,Pyspark Sql,这类似于但是，公认的答案不适用于我的情况，所以在这里提问 |-- Col1: string (nullable = true) |-- Col2: array (nullable = true) |-- element: struct (containsNull = true) |-- Col2Sub: string (nullable = true) 示例JSON {"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2

这类似于

但是，公认的答案不适用于我的情况，所以在这里提问

|-- Col1: string (nullable = true)
|-- Col2: array (nullable = true)
    |-- element: struct (containsNull = true)
          |-- Col2Sub: string (nullable = true)

示例JSON

{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}

这将在单个列中给出结果

import pyspark.sql.functions as F
df.selectExpr("EXPLODE(Col2) AS structCol").select(F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show()
    +----------------+
    | Col2_concated  |
    +----------------+
    |foo,bar         |
    +----------------+

但是，如何获得这样的结果或数据帧呢

+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo,bar        |
+-------+---------------+

编辑： 这个解决方案给出了错误的结果

df.selectExpr("Col1","EXPLODE(Col2) AS structCol").select("Col1", F.expr("concat_ws(',', structCol.*)").alias("Col2_concated")).show() 


+-------+---------------+
|Col1   | Col2_concated |
+-------+---------------+
|abc123 |foo            |
+-------+---------------+
|abc123 |bar            |
+-------+---------------+

只要避免爆炸，你已经在那里了。您所需要的只是函数。此函数使用给定的分隔符连接多个字符串列。见下例：

from pyspark.sql import functions as F
j = '{"Col1":"abc123","Col2":[{"Col2Sub":"foo"},{"Col2Sub":"bar"}]}'
df = spark.read.json(sc.parallelize([j]))

#printSchema tells us the column names we can use with concat_ws                                                                              
df.printSchema()

输出：

root
 |-- Col1: string (nullable = true)
 |-- Col2: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- Col2Sub: string (nullable = true)

Col2列是Col2Sub的数组，我们可以使用此列名获得所需的结果：

bla = df.withColumn('Col2', F.concat_ws(',', df.Col2.Col2Sub))

bla.show()
+------+-------+                                                                
|  Col1|   Col2|
+------+-------+
|abc123|foo,bar|
+------+-------+

您可以选择“Col1”，尽管它不是表达式df.selectExpr（“Col1”，“EXPLODE（Col2）AS structCol”）。选择（“Col1”，F.expr（“concat_ws（'，'，structCol.*））。别名（“Col2_concated”））。show（）抱歉，这不起作用，我在问题中添加了原因谢谢！，你是救命恩人：）