Arrays 将Json数组拆分为两行spark scala_Arrays_Json_Scala_Apache Spark

Arrays 将Json数组拆分为两行spark scala

arrays json scala apache-spark

Arrays 将Json数组拆分为两行spark scala,arrays,json,scala,apache-spark,Arrays,Json,Scala,Apache Spark,我有这样一个数据帧： root |-- runKeyId: string (nullable = true) |-- entities: string (nullable = true) 我想用scala来阐述一下： +--------+--------------------------------------------------------------------------------------------+ |runKeyId|entities

我有这样一个数据帧：

root
 |-- runKeyId: string (nullable = true)
 |-- entities: string (nullable = true)

我想用scala来阐述一下：

+--------+--------------------------------------------------------------------------------------------+
|runKeyId|entities                                                                                    |
+--------+--------------------------------------------------------------------------------------------+
|1       |{"Partition":[{"Name":"ABC"},{"Name":"DBC"}],"id":339}
+--------+--------------------------------------------------------------------------------------------+
|2       |{"Partition":{"Name":"DDD"},"id":339}
+--------+--------------------------------------------------------------------------------------------+

看起来您没有有效的JSON，所以请先修复JSON，然后您可以将其读取为JSON并按如下所示分解

val df = Seq(
  ("1", "{\"Partition\":[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}],\"id\":339},{\"Partition\":{\"Name\":\"DDD\"},\"id\":339}")
).toDF("runKeyId", "entities")
  .withColumn("entities", concat(lit("["), $"entities", lit("]"))) //fix the json 


val resultDF = df.withColumn("entities",
  explode(from_json($"entities", schema_of_json(df.select($"entities").first().getString(0))))
).withColumn("entities", to_json($"entities"))


resultDF.show(false)

输出：

+--------+----------------------------------------------------------------+
|runKeyId|entities                                                        |
+--------+----------------------------------------------------------------+
|1       |{"Partition":"[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]","id":339}|
|1       |{"Partition":"{\"Name\":\"DDD\"}","id":339}                     |
+--------+----------------------------------------------------------------+

你是怎么看文件的？它看起来像jsonl格式，然后您可以简单地读取

spark.read.json（“json_path”）

自动将json分隔为行。这里输入我将它作为字符串获取，而不是json如何读取输入json的数据？val parseDF=decompressDataDF。select（$“_1.entities”），我在这里提供了类似问题的答案。请看一看-是否可以将结果作为json而不是字符串获取，因为它会影响进一步的logicroot |--Id:string（nullable=true）|--entities:json（nullable=true）作为json的结果是什么意思，没有这样的json类型？你能分享你的输出应该是什么样子，或者输出模式吗？它是在“[{\'Name\'：\'ABC\'”}，{\'Name\'：\'DBC\'”}”中添加一个额外的“，”使它成为一个string@shreypavagadhi这是因为分区中也有无效的jason。其中第一个分区是数组类型，第二个分区是对象

+--------+----------------------------------------------------------------+
|runKeyId|entities                                                        |
+--------+----------------------------------------------------------------+
|1       |{"Partition":"[{\"Name\":\"ABC\"},{\"Name\":\"DBC\"}]","id":339}|
|1       |{"Partition":"{\"Name\":\"DDD\"}","id":339}                     |
+--------+----------------------------------------------------------------+