SPARK:如何使用SPARK解析JSON对象数组

SPARK:如何使用SPARK解析JSON对象数组,json,apache-spark,apache-spark-sql,schema,Json,Apache Spark,Apache Spark Sql,Schema,我有一个包含普通列的文件和一个包含Json字符串的列,如下所示。另附图片。每一行实际上属于一个名为Demo(在pic中不可见)的列。其他列被删除,在pic中不可见,因为它们现在不需要考虑 [{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}] 请不要更改JSON的格式,因为它在数据文件中如上所述,但所有内容都

我有一个包含普通列的文件和一个包含Json字符串的列,如下所示。另附图片。每一行实际上属于一个名为Demo(在pic中不可见)的列。其他列被删除,在pic中不可见,因为它们现在不需要考虑

[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]
请不要更改JSON的格式,因为它在数据文件中如上所述,但所有内容都在一行中

每行在列say JSON下有一个这样的对象。对象都在一行中,但在一个数组中。我想使用spark解析此列并访问其中每个对象的值。请帮忙

我想要的是得到键“value”的值。我的目标是将每个JSON对象的“value”键的值提取到单独的列中

我尝试使用get_json_对象。它适用于以下1)Json字符串,但对于Json字符串返回null(2)

  • {“键”:“设备种类”,“值”:“桌面”}
  • [{“键”:“设备种类”,“值”:“桌面”},{“键”:“国家代码”,“值”:“ID”},{“键”:“设备平台”,“值”:“windows”}]
  • 我试过的代码如下

    val jsonDF1 = spark.range(1).selectExpr(""" '{"key":"device_kind","value":"desktop"}' as jsonString""")
    
    jsonDF1.select(get_json_object(col("jsonString"), "$.value") as "device_kind").show(2)// prints desktop under column named device_kind
    
    val jsonDF2 = spark.range(1).selectExpr(""" '[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]' as jsonString""")
    
    jsonDF2.select(get_json_object(col("jsonString"), "$.[0].value") as "device_kind").show(2)// print null but expected is desktop under column named device_kind
    
    接下来,我想使用from_Json,但我不知道如何为Json对象数组构建模式。我找到的所有示例都是嵌套的JSON对象,但与上面的JSON字符串完全不同

    我确实发现,在sparkR 2.2 from_中,Json有一个布尔参数,如果设置为true,它将处理上述类型的Json字符串,即Json对象数组,但该选项在Spark Scala 2.3.3中不可用

    为明确输入和预期输出,应如下所示

    i/p如下

    +------------------------------------------------------------------------+
    |Demographics                                                            |
    +------------------------------------------------------------------------+
    |[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|
    |[[device_kind, mobile], [country_code, BE], [device_platform, android]] |
    |[[device_kind, mobile], [country_code, QA], [device_platform, android]] |
    +------------------------------------------------------------------------+
    
    预期o/p低于

    +------------------------------------------------------------------------+-----------+------------+---------------+
    |Demographics                                                            |device_kind|country_code|device_platform|
    +------------------------------------------------------------------------+-----------+------------+---------------+
    |[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop    |ID          |windows        |
    |[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile     |BE          |android        |
    |[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile     |QA          |android        |
    +------------------------------------------------------------------------+-----------+------------+---------------+
    

    如果使用JSON的列如下所示

        import spark.implicits._
    
        val inputDF = Seq(
          ("""[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]"""),
          ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}]"""),
          ("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}]""")
        ).toDF("Demographics")
    
      inputDF.show(false)
    +-------------------------------------------------------------------------------------------------------------------------+
    |Demographics                                                                                                             |
    +-------------------------------------------------------------------------------------------------------------------------+
    |[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]|
    |[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}] |
    |[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}] |
    +-------------------------------------------------------------------------------------------------------------------------+
    
    您可以尝试按以下方式分析该列:

      val parsedJson: DataFrame = inputDF.selectExpr("Demographics", "from_json(Demographics, 'array<struct<key:string,value:string>>') as parsed_json")
    
      val splitted = parsedJson.select(
        col("parsed_json").as("Demographics"),
        col("parsed_json").getItem(0).as("device_kind_json"),
        col("parsed_json").getItem(1).as("country_code_json"),
        col("parsed_json").getItem(2).as("device_platform_json")
      )
    
      val result = splitted.select(
        col("Demographics"),
        col("device_kind_json.value").as("device_kind"),
        col("country_code_json.value").as("country_code"),
        col("device_platform_json.value").as("device_platform")
      )
    
      result.show(false)
    

    谢谢你的回答,很好用。 我用了稍微不同的方法来解决这个问题,因为我使用的是2.3.3 spark

    val sch = ArrayType(StructType(Array(
      StructField("key", StringType, true),
      StructField("value", StringType, true)
    )))
    
    val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))
    
    val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
      .withColumn("country_code", expr("Demographics[1].value"))
      .withColumn("device_platform", expr("Demographics[2].value"))
    

    你想要的产量是多少?也许这会有所帮助:解析意味着什么并访问每个对象的值?有点困惑。嗨,Aleh,这种转换方式很难查询特定设备类型(如桌面)的相关国家代码、设备平台。我想形成列device\u kind country\u code、device\u platform,并为每一行提供相应的值。嗨,BishamonTen。我已经编辑了答案。检查解决方案是否符合您的需要。嗨,Aleh,我想知道您是否有一些关于单元测试以及Spark代码/应用程序集成测试的经验。我有一些代码要进行单元测试。你能帮我吗?嗨,比沙蒙。如果你分享更多的信息,我可以说我能提供什么帮助。我已经在这个链接中发布了我的问题,请提供您的宝贵意见。请问您上面提到的mdf是谁?
    val sch = ArrayType(StructType(Array(
      StructField("key", StringType, true),
      StructField("value", StringType, true)
    )))
    
    val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))
    
    val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
      .withColumn("country_code", expr("Demographics[1].value"))
      .withColumn("device_platform", expr("Demographics[2].value"))