将json字符串转换为Spark scala中的键值对数组_Scala_Apache Spark_Apache Spark Sql

将json字符串转换为Spark scala中的键值对数组

scala apache-spark

将json字符串转换为Spark scala中的键值对数组,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有一个JSON字符串，加载到Spark数据帧中。JSON字符串可以有0到3个键值对当发送多个kv对时，产品面正确格式化为如下数组： {"id":1, "productData":{ "product":{ "product_name":"xyz", "product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]} }}} 我现在可以使用explo

我有一个JSON字符串，加载到Spark数据帧中。JSON字符串可以有0到3个键值对

当发送多个kv对时，

产品面

正确格式化为如下数组：

{"id":1,
  "productData":{
  "product":{
  "product_name":"xyz",
  "product_facets":{"entry":[{"key":"test","value":"success"}, {"key": "test2","value" : "fail"}]}
 }}}

我现在可以使用explode功能：

sourceDF.filter($"someKey".contains("some_string"))
  .select($"id", explode($"productData.product.product_facets.entry") as "kvPairs")

但是，当只发送了一个键值时，条目的源JSON字符串不会格式化为带方括号的数组：

{"id":1,
  "productData":{
  "product":{
  "product_name":"xyz",
  "product_facets":{"entry":{"key":"test","value":"success"}}
 }}}

product标记的架构如下所示：

|    |-- product: struct (nullable = true)
|    |    |-- product_facets: struct (nullable = true)
|    |    |    |-- entry: string (nullable = true)
|    |    |-- product_name: string (nullable = true)

如何将条目更改为与explode函数兼容的键值对数组。我的最终目标是将键旋转到单独的列中，我希望在分解kv对时使用group by。我试着使用来自_json的

，但没能让它工作
val模式=
结构类型(
序号(
结构字段（“条目”，ArrayType(
结构类型(
序号(
StructField（“键”，StringType），
StructField（“值”，StringType）
)
)
))
)
)
sourceDF.filter（$“someKey.”包含（“some_字符串”））
。从_json（$“productData.product.product_facets.entry”，schema）中选择（$“id”，作为“kvPairsFromJson”）

但是上面确实创建了一个新的列kvPairsFromJson，看起来像“[]”，而使用explode什么也不做
有没有关于发生了什么或者是否有更好的方法来做这件事的建议？
我认为一种方法可以是：

1.创建一个udf，将条目
值作为json字符串，并将其转换为列表（元组（K，V））


2.在自定义项中，检查条目
值是否为数组，并相应地进行转换
下面的代码解释了上述方法：

// one row where entry is array and other non-array
val ds = Seq("""{"id":1,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":[{"key":"test","value":"success"},{"key":"test2","value":"fail"}]}}}}""", """{"id":2,"productData":{"product":{"product_name":"xyz","product_facets":{"entry":{"key":"test","value":"success"}}}}}""").toDS

val df = spark.read.json(ds)

// Schema used by udf to generate output column    
import org.apache.spark.sql.types._
val outputSchema = ArrayType(StructType(Seq(
  StructField("key", StringType, false),
  StructField("value", StringType, false)
)))

// Converts non-array entry value to array
val toArray = udf((json: String) => {

  import com.fasterxml.jackson.databind._
  import com.fasterxml.jackson.module.scala.DefaultScalaModule

  val jsonMapper = new ObjectMapper()
  jsonMapper.registerModule(DefaultScalaModule)

  if(!json.startsWith("[")) {
    val jsonMap = jsonMapper.readValue(json, classOf[Map[String, String]])
    List((jsonMap("key"), jsonMap("value")))
  } else {
    jsonMapper.readValue(json, classOf[List[Map[String, String]]]).map(f => (f("key"), f("value")))
  } 

}, outputSchema)

val arrayResult = df.select(col("id").as("id"), toArray(col("productData.product.product_facets.entry")).as("entry"))

val arrayExploded = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry"))

val explodedToCols = df.select(col("id").as("id"), explode(toArray(col("productData.product.product_facets.entry"))).as("entry")).select(col("id"), col("entry.key").as("key"), col("entry.value").as("value"))

结果：

scala> arrayResult.printSchema
root
 |-- id: long (nullable = true)
 |-- entry: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- key: string (nullable = false)
 |    |    |-- value: string (nullable = false)


scala> arrayExploded.printSchema
root
 |-- id: long (nullable = true)
 |-- entry: struct (nullable = true)
 |    |-- key: string (nullable = false)
 |    |-- value: string (nullable = false)

scala> arrayResult.show(false)
+---+--------------------------------+
|id |entry                           |
+---+--------------------------------+
|1  |[[test, success], [test2, fail]]|
|2  |[[test, success]]               |
+---+--------------------------------+

scala> arrayExploded.show(false)
+---+---------------+
|id |entry          |
+---+---------------+
|1  |[test, success]|
|1  |[test2, fail]  |
|2  |[test, success]|
+---+---------------+

您有两种类型的数据：一种是它的“产品面”是数组，另一种是它的“产品面”是字符串。我说得对吗？您正试图加载这两个字段，并将它们作为一个单一类型字段（产品方面）进行处理。是这样吗？@nir hedvat是的，没错。一个是数组，另一个是字符串，我想将它们都视为数组，以便能够使用explode函数。使用简单的SQL查询是不可行的，因为Spark无法使用多个模式（读写模式）处理数据。您应该使用UDF来实现这一点。看看这个。只需传递保存数据的字段，并始终为其返回一个数组。感谢您编写代码段。这真的很有帮助。在我接受这个答案之前。我认为我们需要处理“json”的输入字符串可以为空的情况。在某些情况下，product_facets是一个空字符串，在udf中如何处理它？我找到了答案。如果product_facets是空字符串，则udf需要稍微修改以返回空数组。