SPARK:如何使用SPARK解析JSON对象数组
我有一个包含普通列的文件和一个包含Json字符串的列,如下所示。另附图片。每一行实际上属于一个名为Demo(在pic中不可见)的列。其他列被删除,在pic中不可见,因为它们现在不需要考虑SPARK:如何使用SPARK解析JSON对象数组,json,apache-spark,apache-spark-sql,schema,Json,Apache Spark,Apache Spark Sql,Schema,我有一个包含普通列的文件和一个包含Json字符串的列,如下所示。另附图片。每一行实际上属于一个名为Demo(在pic中不可见)的列。其他列被删除,在pic中不可见,因为它们现在不需要考虑 [{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}] 请不要更改JSON的格式,因为它在数据文件中如上所述,但所有内容都
[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]
请不要更改JSON的格式,因为它在数据文件中如上所述,但所有内容都在一行中
每行在列say JSON下有一个这样的对象。对象都在一行中,但在一个数组中。我想使用spark解析此列并访问其中每个对象的值。请帮忙
我想要的是得到键“value”的值。我的目标是将每个JSON对象的“value”键的值提取到单独的列中
我尝试使用get_json_对象。它适用于以下1)Json字符串,但对于Json字符串返回null(2)
val jsonDF1 = spark.range(1).selectExpr(""" '{"key":"device_kind","value":"desktop"}' as jsonString""")
jsonDF1.select(get_json_object(col("jsonString"), "$.value") as "device_kind").show(2)// prints desktop under column named device_kind
val jsonDF2 = spark.range(1).selectExpr(""" '[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]' as jsonString""")
jsonDF2.select(get_json_object(col("jsonString"), "$.[0].value") as "device_kind").show(2)// print null but expected is desktop under column named device_kind
接下来,我想使用from_Json,但我不知道如何为Json对象数组构建模式。我找到的所有示例都是嵌套的JSON对象,但与上面的JSON字符串完全不同
我确实发现,在sparkR 2.2 from_中,Json有一个布尔参数,如果设置为true,它将处理上述类型的Json字符串,即Json对象数组,但该选项在Spark Scala 2.3.3中不可用
为明确输入和预期输出,应如下所示
i/p如下
+------------------------------------------------------------------------+
|Demographics |
+------------------------------------------------------------------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |
+------------------------------------------------------------------------+
预期o/p低于
+------------------------------------------------------------------------+-----------+------------+---------------+
|Demographics |device_kind|country_code|device_platform|
+------------------------------------------------------------------------+-----------+------------+---------------+
|[[device_kind, desktop], [country_code, ID], [device_platform, windows]]|desktop |ID |windows |
|[[device_kind, mobile], [country_code, BE], [device_platform, android]] |mobile |BE |android |
|[[device_kind, mobile], [country_code, QA], [device_platform, android]] |mobile |QA |android |
+------------------------------------------------------------------------+-----------+------------+---------------+
如果使用JSON的列如下所示
import spark.implicits._
val inputDF = Seq(
("""[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]"""),
("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}]"""),
("""[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}]""")
).toDF("Demographics")
inputDF.show(false)
+-------------------------------------------------------------------------------------------------------------------------+
|Demographics |
+-------------------------------------------------------------------------------------------------------------------------+
|[{"key":"device_kind","value":"desktop"},{"key":"country_code","value":"ID"},{"key":"device_platform","value":"windows"}]|
|[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"BE"},{"key":"device_platform","value":"android"}] |
|[{"key":"device_kind","value":"mobile"},{"key":"country_code","value":"QA"},{"key":"device_platform","value":"android"}] |
+-------------------------------------------------------------------------------------------------------------------------+
您可以尝试按以下方式分析该列:
val parsedJson: DataFrame = inputDF.selectExpr("Demographics", "from_json(Demographics, 'array<struct<key:string,value:string>>') as parsed_json")
val splitted = parsedJson.select(
col("parsed_json").as("Demographics"),
col("parsed_json").getItem(0).as("device_kind_json"),
col("parsed_json").getItem(1).as("country_code_json"),
col("parsed_json").getItem(2).as("device_platform_json")
)
val result = splitted.select(
col("Demographics"),
col("device_kind_json.value").as("device_kind"),
col("country_code_json.value").as("country_code"),
col("device_platform_json.value").as("device_platform")
)
result.show(false)
谢谢你的回答,很好用。 我用了稍微不同的方法来解决这个问题,因为我使用的是2.3.3 spark
val sch = ArrayType(StructType(Array(
StructField("key", StringType, true),
StructField("value", StringType, true)
)))
val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))
val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
.withColumn("country_code", expr("Demographics[1].value"))
.withColumn("device_platform", expr("Demographics[2].value"))
你想要的产量是多少?也许这会有所帮助:解析意味着什么并访问每个对象的值?有点困惑。嗨,Aleh,这种转换方式很难查询特定设备类型(如桌面)的相关国家代码、设备平台。我想形成列device\u kind country\u code、device\u platform,并为每一行提供相应的值。嗨,BishamonTen。我已经编辑了答案。检查解决方案是否符合您的需要。嗨,Aleh,我想知道您是否有一些关于单元测试以及Spark代码/应用程序集成测试的经验。我有一些代码要进行单元测试。你能帮我吗?嗨,比沙蒙。如果你分享更多的信息,我可以说我能提供什么帮助。我已经在这个链接中发布了我的问题,请提供您的宝贵意见。请问您上面提到的mdf是谁?
val sch = ArrayType(StructType(Array(
StructField("key", StringType, true),
StructField("value", StringType, true)
)))
val jsonDF3 = mdf.select(from_json(col("jsonString"), sch).alias("Demographics"))
val jsonDF4 = jsonDF3.withColumn("device_kind", expr("Demographics[0].value"))
.withColumn("country_code", expr("Demographics[1].value"))
.withColumn("device_platform", expr("Demographics[2].value"))