来自_json的火花-StructType和ArrayType_Json_Scala_Apache Spark_Apache Spark Sql

来自_json的火花-StructType和ArrayType

json scala apache-spark

来自_json的火花-StructType和ArrayType,json,scala,apache-spark,apache-spark-sql,Json,Scala,Apache Spark,Apache Spark Sql,我有一个XML格式的数据集，其中一个节点包含JSON。Spark将此作为StringType读取，因此我尝试使用from_json（）将json转换为数据帧我能够转换JSON字符串，但是如何编写模式来处理数组呢没有数组的字符串-工作正常 import org.apache.spark.sql.functions._ val schemaExample = new StructType() .add("FirstName", StringType) .

我有一个XML格式的数据集，其中一个节点包含JSON。Spark将此作为StringType读取，因此我尝试使用from_json（）将json转换为数据帧

我能够转换JSON字符串，但是如何编写模式来处理数组呢

没有数组的字符串-工作正常

import org.apache.spark.sql.functions._

val schemaExample = new StructType()
          .add("FirstName", StringType)
          .add("Surname", StringType)

val dfExample = spark.sql("""select "{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }" as theJson""")

val dfICanWorkWith = dfExample.select(from_json($"theJson", schemaExample))

dfICanWorkWith.collect()

// Results \\
res19: Array[org.apache.spark.sql.Row] = Array([[Johnny,Boy]])

带数组的字符串-无法识别此字符串

import org.apache.spark.sql.functions._

val schemaExample2 = new StructType()
                              .add("", ArrayType(new StructType()
                                                          .add("FirstName", StringType)
                                                          .add("Surname", StringType)
                                                )
                                  )

val dfExample2= spark.sql("""select "[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }" as theJson""")

val dfICanWorkWith = dfExample2.select(from_json($"theJson", schemaExample2))

dfICanWorkWith.collect()

// Result \\
res22: Array[org.apache.spark.sql.Row] = Array([null])

问题是您没有完全限定的json。您的json缺少以下几点：

首先，您缺少执行json的周围{}
其次，缺少变量值（将其设置为“”，但未添加）
最后，您错过了结束语]

尝试将其替换为：

val dfExample2= spark.sql("""select "{\"\":[{ \"FirstName\":\"Johnny\", \"Surname\":\"Boy\" }, { \"FirstName\":\"Franky\", \"Surname\":\"Man\" }]}" as theJson""")

您将获得：

scala> dfICanWorkWith.collect()
res12: Array[org.apache.spark.sql.Row] = Array([[WrappedArray([Johnny,Boy], [Franky,Man])]])

从spark 2.4开始，json的模式有助于：

>选择json的模式（“[{”col:0}]”）；
排列

在您的情况下，您可以使用以下代码来解析该子对象数组：

scala>spark.sql（“[名字]：“约翰尼”，“姓氏]：“男孩]，[名字]：“弗兰基]，“姓氏]：“男人]，“数组”）中选择作为json“”。show（false）
+------------------------------+
|泰森|
+------------------------------+
|[[约翰尼，小子][弗兰基，伙计]]|
+------------------------------+