Arrays 将Json数组拆分为两行spark scala

Arrays 将Json数组拆分为两行spark scala,arrays,json,scala,apache-spark,Arrays,Json,Scala,Apache Spark,我有这样一个数据帧: root |-- runKeyId: string (nullable = true) |-- entities: string (nullable = true) 我想用scala来阐述一下: +--------+--------------------------------------------------------------------------------------------+ |runKeyId|entities

我有这样一个数据帧:

root
 |-- runKeyId: string (nullable = true)
 |-- entities: string (nullable = true)
我想用scala来阐述一下:

+--------+--------------------------------------------------------------------------------------------+
|runKeyId|entities                                                                                    |
+--------+--------------------------------------------------------------------------------------------+
|1       |{"Partition":[{"Name":"ABC"},{"Name":"DBC"}],"id":339}
+--------+--------------------------------------------------------------------------------------------+
|2       |{"Partition":{"Name":"DDD"},"id":339}
+--------+--------------------------------------------------------------------------------------------+

看起来您没有有效的JSON,所以请先修复JSON,然后您可以将其读取为JSON并按如下所示分解

val df = Seq(
  ("1", "{\"Partition\":[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}],\"id\":339},{\"Partition\":{\"Name\":\"DDD\"},\"id\":339}")
).toDF("runKeyId", "entities")
  .withColumn("entities", concat(lit("["), $"entities", lit("]"))) //fix the json 


val resultDF = df.withColumn("entities",
  explode(from_json($"entities", schema_of_json(df.select($"entities").first().getString(0))))
).withColumn("entities", to_json($"entities"))


resultDF.show(false)
输出:

+--------+----------------------------------------------------------------+
|runKeyId|entities                                                        |
+--------+----------------------------------------------------------------+
|1       |{"Partition":"[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]","id":339}|
|1       |{"Partition":"{\"Name\":\"DDD\"}","id":339}                     |
+--------+----------------------------------------------------------------+

你是怎么看文件的?它看起来像jsonl格式,然后您可以简单地读取
spark.read.json(“json_path”)
自动将json分隔为行。这里输入我将它作为字符串获取,而不是json如何读取输入json的数据?val parseDF=decompressDataDF。select($“_1.entities”),我在这里提供了类似问题的答案。请看一看-是否可以将结果作为json而不是字符串获取,因为它会影响进一步的logicroot |--Id:string(nullable=true)|--entities:json(nullable=true)作为json的结果是什么意思,没有这样的json类型?你能分享你的输出应该是什么样子,或者输出模式吗?它是在“[{\'Name\':\'ABC\'”},{\'Name\':\'DBC\'”}”中添加一个额外的“,”使它成为一个string@shreypavagadhi这是因为分区中也有无效的jason。其中第一个分区是数组类型,第二个分区是对象
+--------+----------------------------------------------------------------+
|runKeyId|entities                                                        |
+--------+----------------------------------------------------------------+
|1       |{"Partition":"[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]","id":339}|
|1       |{"Partition":"{\"Name\":\"DDD\"}","id":339}                     |
+--------+----------------------------------------------------------------+