Python 如何在pyspark中从数据帧中的分解值附加值
数据是Python 如何在pyspark中从数据帧中的分解值附加值,python,apache-spark,pyspark,apache-spark-sql,Python,Apache Spark,Pyspark,Apache Spark Sql,数据是 data = [{"_id":"Inst001","Type":"AAAA", "Model001":[{"_id":"Mod001", "Name": "FFFF"}, {"_id":&
data = [{"_id":"Inst001","Type":"AAAA", "Model001":[{"_id":"Mod001", "Name": "FFFF"},
{"_id":"Mod0011", "Name": "FFFF4"}]},
{"_id":"Inst002", "Type":"BBBB", "Model001":[{"_id":"Mod002", "Name": "DDD"}]}]
需要按如下方式构建数据帧
pid
_身份证
名称
Inst001
Mod001
FFFF
Inst001
Mod0011
FFFF4
Inst002
Mod002
DDD
使用适当的架构创建数据帧,并在
Model001
列上执行inline
:
df = spark.createDataFrame(
data,
'_id string, Type string, Model001 array<struct<_id:string, Name:String>>'
).selectExpr('_id as pid', 'inline(Model001)')
df.show(truncate=False)
+-------+-------+-----+
|pid |_id |Name |
+-------+-------+-----+
|Inst001|Mod001 |FFFF |
|Inst001|Mod0011|FFFF4|
|Inst002|Mod002 |DDD |
+-------+-------+-----+
df=spark.createDataFrame(
数据,
“\u id字符串,类型字符串,Model001数组”
).选择EXPR(“id为pid”,“内联(Model001)”)
df.show(truncate=False)
+-------+-------+-----+
|pid | | id |名称|
+-------+-------+-----+
|Inst001 | Mod001 | FFFF|
|Inst001 | Mod0011 | FFFF4|
|Inst002 | Mod002 | DDD|
+-------+-------+-----+
到目前为止,您尝试了什么?@nerdyGuy我已经分解了“Model001”,但在将主id附加到分解的数据帧时遇到了麻烦