Json 取消Pivot列pyspark dataframe，其中值为字典列表_Json_Pandas_Apache Spark_Pyspark_Apache Spark Sql

Json 取消Pivot列pyspark dataframe，其中值为字典列表

json pandas apache-spark pyspark

Json 取消Pivot列pyspark dataframe，其中值为字典列表,json,pandas,apache-spark,pyspark,apache-spark-sql,Json,Pandas,Apache Spark,Pyspark,Apache Spark Sql,我从字典列表中创建了一个pandas数据框架，并使用json_normalize取消了一列。现在我必须将代码转换为使用pyspark而不是pandas df = pd.json_normalize(list_json,'Messages',['ID']) ID, Active, Description, Priority 21122, true ,Test description1, 2 21233,true ,Test description1, 2 21233,true ,test2 ,

我从字典列表中创建了一个pandas数据框架，并使用json_normalize取消了一列。现在我必须将代码转换为使用pyspark而不是pandas

df = pd.json_normalize(list_json,'Messages',['ID'])

ID, Active, Description, Priority
21122, true ,Test description1, 2
21233,true ,Test description1, 2
21233,true ,test2 , 3

在Pyspark中，我无法找到类似的函数

我已经用下面的代码创建了一个数据帧。但我不知道如何像上面那样把它拆开

df = spark.sparkContext.parallelize(list_json_messages_tea).map(lambda x: json.dumps(x))
df = spark.read.json(df)

ID, Messages
21122, [{"Active": "true", "Description": "Test description1", "Priority": "2"}]
21233, [{"Active": "true", "Description": "Test description1", "Priority": "2"}, {"Active": "true", "Description": "test2",  "Priority": "3"}]

我认为等效的方法是使用

内联（来自_json（））

：

df2=df.selectExpr（'ID'，“inline（来自_json（Messages，'array'））”）
df2.show（）
+-----+------+-----------------+--------+
|ID |活动|描述|优先级|
+-----+------+-----------------+--------+
|21122 |正确|测试描述1 | 2|
|21233 |正确|测试说明1 | 2|
|21233 |正确|测试2 | 3|
+-----+------+-----------------+--------+

df2 = df.selectExpr('ID', "inline(from_json(Messages, 'array<struct<Active:string,Description:string,Priority:string>>'))")

df2.show()
+-----+------+-----------------+--------+
|   ID|Active|      Description|Priority|
+-----+------+-----------------+--------+
|21122|  true|Test description1|       2|
|21233|  true|Test description1|       2|
|21233|  true|            test2|       3|
+-----+------+-----------------+--------+