Python PySpark-Json分解嵌套了Struct和Struct数组
我试图用一些示例json解析嵌套json。下面是打印模式Python PySpark-Json分解嵌套了Struct和Struct数组,python,pyspark,Python,Pyspark,我试图用一些示例json解析嵌套json。下面是打印模式 |-- batters: struct (nullable = true) | |-- batter: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- id: string (nullable = true) | | | |-- type: string (null
|-- batters: struct (nullable = true)
| |-- batter: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- id: string (nullable = true)
| | | |-- type: string (nullable = true)
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- ppu: double (nullable = true)
|-- topping: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- type: string (nullable = true)
|-- type: string (nullable = true)
试着将面糊炸开,分别加满并混合
df_batter = df_json.select("batters.*")
df_explode1= df_batter.withColumn("batter", explode("batter")).select("batter.*")
df_explode2= df_json.withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*")
无法合并这两个数据帧
尝试使用单一查询
exploded1 = df_json.withColumn("batter", df_batter.withColumn("batter",
explode("batter"))).withColumn("topping", explode("topping")).select("id",
"type","name","ppu","topping.*","batter.*")
但是有错误,请帮我解决。谢谢您基本上必须使用
数组
将数组
分解到一起使用数组_-zip
返回一个结构的合并数组。试试这个。我还没有测试过,但应该可以用
from pyspark.sql import functions as F
df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("zipped", F.explode(F.arrays_zip("batter","topping")))\
.select("id","type","name","ppu","zipped.*").show()
您也可以一个接一个地执行:
from pyspark.sql import functions as F
df1=df_json.select("id","type","name","ppu","topping","batters.*")\
.withColumn("batter", F.explode("batter"))\
.select("id","type","name","ppu","topping","batter")
df1.withColumn("topping", F.explode("topping")).select("id","type","name","ppu","topping.*","batter.*")
你不能像那样分解两个数组。您需要使用数组压缩它们,然后将它们分解到一起