Apache spark 带模式的读取JSON数组字符串返回空spark 2.2.0_Apache Spark_Apache Spark Sql

Apache spark 带模式的读取JSON数组字符串返回空spark 2.2.0

apache-spark

Apache spark 带模式的读取JSON数组字符串返回空spark 2.2.0,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,当我尝试将包含JSON字符串的spark dataframe列作为数组读取时，如果使用已定义的模式，它将返回null。我尝试了模式的Array、Seq和List，但都返回null。我的spark版本是2.2.0 val dfdata= spark.sql("""select "\[{ \"id\":\"93993\", \"name\":\"Phil\" }, { \"id\":\"838\", \"name\":\"Don\" }]" as theJson""") dfdata.show(5,

当我尝试将包含JSON字符串的spark dataframe列作为数组读取时，如果使用已定义的模式，它将返回null。我尝试了模式的Array、Seq和List，但都返回null。我的spark版本是2.2.0

val dfdata= spark.sql("""select "\[{ \"id\":\"93993\", \"name\":\"Phil\" }, { \"id\":\"838\", \"name\":\"Don\" }]" as theJson""")
dfdata.show(5,false)

val sch = StructType(
  Array(StructField("id", StringType, true),
      StructField("name", StringType, true)))
print(sch.prettyJson )                                             
dfdata.select(from_json($"theJson", sch)).show

以及输出

+---------------------------------------------------------------+
|theJson                                                        |
+---------------------------------------------------------------+
|[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]|
+---------------------------------------------------------------+

{
  "type" : "struct",
  "fields" : [ {
    "name" : "id",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  }, {
    "name" : "name",
    "type" : "string",
    "nullable" : true,
    "metadata" : { }
  } ]
}+----------------------+
|jsontostructs(theJson)|
+----------------------+
|                  null|
+----------------------+

您的模式与您的示例不太相符。您的示例是一个结构数组。尝试将其包装在

数组类型中

：

val sch = ArrayType(StructType(Array(
  StructField("id", StringType, true),
  StructField("name", StringType, true)
)))

在获取DF之前，您是否尝试过解析json字符串

// obtaining this string should be easy:
val jsonStr = """[{ "id":"93993", "name":"Phil" }, { "id":"838", "name":"Don" }]"""

// then you can take advantage of schema inference
val df2 = spark.read.json(Seq(jsonStr).toDS)

df2.show(false)

// it shows:
// +-----+----+
// |id   |name|
// +-----+----+
// |93993|Phil|
// |838  |Don |
// +-----+----+

试试dfdata.select（“theJson”）.show，然后你就会得到你想要的数据。@AlexandrosBiratsis，它只给出原始的JSON字符串，我想做的是，读取JSON数据并拆分为单独的列，如id和名称。我想上面的链接可能是你想要的答案：）trued newDF.select($“parsed.id”、$“parsed.name”）.show（false），give+-----++----++----id | name |+----+----null | null |+---+我的输入数据集是一个csv文件，列为are JSON和其他格式，我不确定将整个文件作为JSON文件阅读是否有助于满足我的要求。