PySpark:Dataframe,具有关系表的嵌套字段

PySpark:Dataframe,具有关系表的嵌套字段,dataframe,apache-spark,pyspark,nested,Dataframe,Apache Spark,Pyspark,Nested,我有一个PySpark学生数据框架,模式如下: Id: string |-- School: array |-- element: struct | |-- Subject: string | |-- Classes: string | |-- Score: array | |-- element: struct | |-- ScoreID: string | |-- Value: string 我想从数据帧中提取

我有一个PySpark学生数据框架,模式如下:

Id: string
 |-- School: array
 |-- element: struct
 |   |-- Subject: string
 |   |-- Classes: string
 |   |-- Score: array
 |       |-- element: struct
 |           |-- ScoreID: string
 |           |-- Value: string

我想从数据帧中提取一些字段,并对其进行规范化,以便将其输入数据库。我期望的关系模式由字段
Id、School、Subject、ScoreId、Value
组成。如何有效地执行此操作?

分解数组以获取展平数据,然后选择所有必需的列

示例:

df.show(10,False)
#+---+--------------------------+
#|Id |School                    |
#+---+--------------------------+
#|1  |[[b, [[A, 3], [B, 4]], a]]|
#+---+--------------------------+

df.printSchema()
#root
# |-- Id: string (nullable = true)
# |-- School: array (nullable = true)
# |    |-- element: struct (containsNull = true)
# |    |    |-- Classes: string (nullable = true)
# |    |    |-- Score: array (nullable = true)
# |    |    |    |-- element: struct (containsNull = true)
# |    |    |    |    |-- ScoreID: string (nullable = true)
# |    |    |    |    |-- Value: string (nullable = true)
# |    |    |-- Subject: string (nullable = true)

df.selectExpr("Id","explode(School)").\
selectExpr("Id","col.*","explode(col.Score)").\
selectExpr("Id","Classes","Subject","col.*").\
show()
#+---+-------+-------+-------+-----+
#| Id|Classes|Subject|ScoreID|Value|
#+---+-------+-------+-------+-----+
#|  1|      b|      a|      A|    3|
#|  1|      b|      a|      B|    4|
#+---+-------+-------+-------+-----+