如何在不知道spark scala中json列的模式的情况下动态解析数据帧中的json列
给定一个spark数据帧,该数据帧的列可能包含也可能不包含嵌套的json。这个嵌套的json是动态的。最终的要求是打破json,并为嵌套json中的每个键生成一个新的数据帧,其中包含新的列 json是动态的,因此生成的表是动态的。还请考虑数据文件由1亿个以上的记录组成。 乙二醇- 输入如何在不知道spark scala中json列的模式的情况下动态解析数据帧中的json列,scala,apache-spark,apache-spark-sql,apache-spark-dataset,Scala,Apache Spark,Apache Spark Sql,Apache Spark Dataset,给定一个spark数据帧,该数据帧的列可能包含也可能不包含嵌套的json。这个嵌套的json是动态的。最终的要求是打破json,并为嵌套json中的每个键生成一个新的数据帧,其中包含新的列 json是动态的,因此生成的表是动态的。还请考虑数据文件由1亿个以上的记录组成。 乙二醇- 输入 ------------------------------------------------------------------------ |id |key |type |
------------------------------------------------------------------------
|id |key |type |value
|f9f |BUSI |off |false
|f96 |NAME |50 |true
|f9z |BANK |off |{"Name":"United School","admNumber":"197108","details":{"code":"WEREFFW32","studentName":"Abhishek kumar","doc":"certificate","admId":"3424325328","stat":0,"studentDetails":false} }|
输出:-
--------------------------------------------------------------------------------------------------------------------------
|id |key |type |value |Name | admNumber |code | studentName | doc |admId |stat | studentDetails
+------------------------------------+-----------------+-------------+----------------------------------------------------
|f9f |BUSI |off |false |NULL |NULL |NULL |NULL |NULL |NULL |NULL |NULL |
|f96 |NAME |50 |true |NULL |NULL |NULL |NULL |NULL |NULL |NULL |NULL |
|f9z |BANK |off |NULL |United School |197108 |WEREFFW32 |Abhishek kumar |certificate |3424325328 |0 |false |
初始数据帧是否只有一个json列?是否希望所有嵌套键都成为列?(展平?)给出一个输入和输出示例。如果您不知道模式,将对性能产生很大影响。或者,如果您在某种程度上可以轻松提高性能之前就知道模式。@ziad.rida是的,初始数据帧有一个json列,所有嵌套键都需要成为列。嵌套json的深度不会超过3或4级。这是否也适用于嵌套json?
val data = Seq(
(77, "email1", """{"key1":38,"key3":39}"""),
(78, "email2", """{"key1":38,"key4":39}"""),
(178, "email21", """{"key1":"when string","key4":36, "key6":"test", "key10":false }"""),
(179, "email8", """{"sub1":"qwerty","sub2":["42"]}"""),
(180, "email8", """{"sub1":"qwerty","sub2":["42", "56", "test"]}""")
).toDF("id", "name", "colJson")
data.show(false)
// +---+-------+---------------------------------------------------------------+
// |id |name |colJson |
// +---+-------+---------------------------------------------------------------+
// |77 |email1 |{"key1":38,"key3":39} |
// |78 |email2 |{"key1":38,"key4":39} |
// |178|email21|{"key1":"when string","key4":36, "key6":"test", "key10":false }|
// |178|email8 |{"sub1":"qwerty","sub2":"42"} |
// +---+-------+---------------------------------------------------------------+
val schema = spark.read.json(data.select("colJson").as[String]).schema
val res = data.select($"id", $"name", from_json($"colJson", schema).as("s")).select("id", "name", "s.*")
res.show(false)
// +---+-------+-----------+-----+----+----+----+------+----+
// |id |name |key1 |key10|key3|key4|key6|sub1 |sub2|
// +---+-------+-----------+-----+----+----+----+------+----+
// |77 |email1 |38 |null |39 |null|null|null |null|
// |78 |email2 |38 |null |null|39 |null|null |null|
// |178|email21|when string|false|null|36 |test|null |null|
// |178|email8 |null |null |null|null|null|qwerty|42 |
// +---+-------+-----------+-----+----+----+----+------+----+
val df1 = res.filter('sub1.equalTo("qwerty"))
df1.show(false)
// +---+------+----+-----+----+----+----+------+----+
// |id |name |key1|key10|key3|key4|key6|sub1 |sub2|
// +---+------+----+-----+----+----+----+------+----+
// |178|email8|null|null |null|null|null|qwerty|42 |
// +---+------+----+-----+----+----+----+------+----+