Arrays spark scala：将结构列的数组转换为字符串列_Arrays_Json_Scala_Apache Spark

Arrays spark scala：将结构列的数组转换为字符串列

arrays json scala apache-spark

Arrays spark scala：将结构列的数组转换为字符串列,arrays,json,scala,apache-spark,Arrays,Json,Scala,Apache Spark,我有一个列，它的类型是array，由json文件推导而来。我想将数组转换为字符串，这样我就可以保持此数组列在配置单元中的状态，并将其作为单个列导出到RDBMS temp.json {"properties":{"items":[{"invoicid":{"value":"923659"},"job_id": {"value":"296160"},"sku_id": {"value":"312002"}}],"user_id":"6666","zip_code":"666"}} 处理： sca

我有一个列，它的类型是array，由json文件推导而来。我想将数组转换为字符串，这样我就可以保持此数组列在配置单元中的状态，并将其作为单个列导出到RDBMS

temp.json

{"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
{"value":"296160"},"sku_id":
{"value":"312002"}}],"user_id":"6666","zip_code":"666"}}

处理：

scala> val temp = spark.read.json("s3://check/1/temp1.json")
temp: org.apache.spark.sql.DataFrame = [properties: struct<items:
array<struct<invoicid:struct<value:string>,job_id:struct<value:string>,sku_id:struct<value:string>>>, user_id: string ... 1 more field>]

    scala> temp.printSchema
    root
     |-- properties: struct (nullable = true)
     |    |-- items: array (nullable = true)
     |    |    |-- element: struct (containsNull = true)
     |    |    |    |-- invoicid: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |    |    |-- job_id: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |    |    |-- sku_id: struct (nullable = true)
     |    |    |    |    |-- value: string (nullable = true)
     |    |-- user_id: string (nullable = true)
     |    |-- zip_code: string (nullable = true)


scala> temp.select("properties").show
+--------------------+
|          properties|
+--------------------+
|[WrappedArray([[9...|
+--------------------+


scala> temp.select("properties.items").show
+--------------------+
|               items|
+--------------------+
|[[[923659],[29616...|
+--------------------+


scala> temp.createOrReplaceTempView("tempTable")

scala> spark.sql("select properties.items  from tempTable").show
+--------------------+
|               items|
+--------------------+
|[[[923659],[29616...|
+--------------------+

在不做任何更改的情况下获取数组元素值。

是您要查找的函数

import org.apache.spark.sql.functions.to_json:

val df = spark.read.json(sc.parallelize(Seq("""
  {"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
  {"value":"296160"},"sku_id":
  {"value":"312002"}}],"user_id":"6666","zip_code":"666"}}""")))


df
  .select(get_json_object(to_json($"properties"), "$.items").alias("items"))
  .show(false)

+-----------------------------------------------------------------------------------------+
|项目|
+-----------------------------------------------------------------------------------------+
|[{“invoicid”：{“value”：“923659”}，“job_id”：{“value”：“296160”}，“sku_id”：{“value”：“312002”}]|
+-----------------------------------------------------------------------------------------+

如何提取附加到根结构的所有列？例如，如果“properties”不存在，我希望select（get_json_object（to_json（（$“*”）），“$.value”）可以工作。但是它没有。

到json（struct（df.columns-map col:*）

[{“invoicid”：{“value”：“923659”}，“job_id”：{“value”：“296160”}，“sku_id”：{“value”：“312002”}]

import org.apache.spark.sql.functions.to_json:

val df = spark.read.json(sc.parallelize(Seq("""
  {"properties":{"items":[{"invoicid":{"value":"923659"},"job_id":
  {"value":"296160"},"sku_id":
  {"value":"312002"}}],"user_id":"6666","zip_code":"666"}}""")))


df
  .select(get_json_object(to_json($"properties"), "$.items").alias("items"))
  .show(false)