Pyspark 关系化json深层嵌套数组
我有下面的目录,想用AWS胶水把它弄平Pyspark 关系化json深层嵌套数组,pyspark,aws-glue,Pyspark,Aws Glue,我有下面的目录,想用AWS胶水把它弄平 | accountId | resourceId | items | |-----------|------------|-----------------------------------------------------------------| | 1 | r1 | {application:{com
| accountId | resourceId | items |
|-----------|------------|-----------------------------------------------------------------|
| 1 | r1 | {application:{component:[{name: "tool", version: "1.0"}, {name: "app", version: "1.0"}]}} |
| 1 | r2 | {application:{component:[{name: "tool", version: "2.0"}, {name: "app", version: "2.0"}]}} |
| 2 | r3 | {application:{component:[{name: "tool", version: "3.0"}, {name: "app", version: "3.0"}]}} |
这是我的模式
root
|-- accountId:
|-- resourceId:
|-- PeriodId:
|-- items:
| |-- application:
| | |-- component: array
我想将其展平为以下内容:
| accountId | resourceId | name | version |
|-----------|------------|------|---------|
| 1 | r1 | tool | 1.0 |
| 1 | r1 | app | 1.0 |
| 1 | r2 | tool | 2.0 |
| 1 | r2 | app | 2.0 |
| 2 | r3 | tool | 3.0 |
| 2 | r3 | app | 3.0 |
根据我从您的架构和数据中了解到的情况,您的架构是一个深度嵌套的结构,因此您可以对
items.application.component
进行分解,然后从中选择您的名称
和版本
列
此链接可能有助于您了解:
from pyspark.sql import functions as F
df.withColumn("items", F.explode(F.col("items.application.component")))\
.select("accountId","resourceId","items.name","items.version").show()
+---------+----------+----+-------+
|accountId|resourceId|name|version|
+---------+----------+----+-------+
| 1| r1|tool| 1.0|
| 1| r1| app| 1.0|
| 1| r2|tool| 2.0|
| 1| r2| app| 2.0|
| 2| r3|tool| 3.0|
| 2| r3| app| 3.0|
+---------+----------+----+-------+