PySpark:Dataframe,具有关系表的嵌套字段
我有一个PySpark学生数据框架,模式如下:PySpark:Dataframe,具有关系表的嵌套字段,dataframe,apache-spark,pyspark,nested,Dataframe,Apache Spark,Pyspark,Nested,我有一个PySpark学生数据框架,模式如下: Id: string |-- School: array |-- element: struct | |-- Subject: string | |-- Classes: string | |-- Score: array | |-- element: struct | |-- ScoreID: string | |-- Value: string 我想从数据帧中提取
Id: string
|-- School: array
|-- element: struct
| |-- Subject: string
| |-- Classes: string
| |-- Score: array
| |-- element: struct
| |-- ScoreID: string
| |-- Value: string
我想从数据帧中提取一些字段,并对其进行规范化,以便将其输入数据库。我期望的关系模式由字段
Id、School、Subject、ScoreId、Value
组成。如何有效地执行此操作?分解数组以获取展平数据,然后选择所有必需的列
示例:
df.show(10,False)
#+---+--------------------------+
#|Id |School |
#+---+--------------------------+
#|1 |[[b, [[A, 3], [B, 4]], a]]|
#+---+--------------------------+
df.printSchema()
#root
# |-- Id: string (nullable = true)
# |-- School: array (nullable = true)
# | |-- element: struct (containsNull = true)
# | | |-- Classes: string (nullable = true)
# | | |-- Score: array (nullable = true)
# | | | |-- element: struct (containsNull = true)
# | | | | |-- ScoreID: string (nullable = true)
# | | | | |-- Value: string (nullable = true)
# | | |-- Subject: string (nullable = true)
df.selectExpr("Id","explode(School)").\
selectExpr("Id","col.*","explode(col.Score)").\
selectExpr("Id","Classes","Subject","col.*").\
show()
#+---+-------+-------+-------+-----+
#| Id|Classes|Subject|ScoreID|Value|
#+---+-------+-------+-------+-----+
#| 1| b| a| A| 3|
#| 1| b| a| B| 4|
#+---+-------+-------+-------+-----+