在Azure DataRicks中使用Scala拆分包含嵌套数组的数组
我目前正在从事一个项目,我必须从json文档中提取一些可怕的嵌套数据(从LogAnalytics REST API调用输出),文档结构示例如下(我有更多的列): 我已经成功地将这个json文档提取到一个数据框架中,但我很难说接下来该怎么做 我试图实现的目标是如下输出:在Azure DataRicks中使用Scala拆分包含嵌套数组的数组,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我目前正在从事一个项目,我必须从json文档中提取一些可怕的嵌套数据(从LogAnalytics REST API调用输出),文档结构示例如下(我有更多的列): 我已经成功地将这个json文档提取到一个数据框架中,但我很难说接下来该怎么做 我试图实现的目标是如下输出: [{ "Category": "Administrative", "count_": 20839 }, { "Category": "
[{
"Category": "Administrative",
"count_": 20839
},
{
"Category": "Recommendation",
"count_": 122
},
{
"Category": "Alert",
"count_": 64
},
{
"Category": "ServiceHealth",
"count_": 11
}]
理想情况下,我希望使用我的columns数组作为每条记录的标题。然后,我想将父行数组中的每个记录数组拆分为它自己的记录
到目前为止,我已经尝试将原始导入的数据帧展平,但这不起作用,因为行数据是数组的数组
我该如何解决这个难题呢?处理这个问题有点麻烦,但这里有一个方法:
val df = spark.read.option("multiline",true).json("filepath")
val result = df.select(explode($"tables").as("tables"))
.select($"tables.columns".as("col"), explode($"tables.rows").as("row"))
.selectExpr("inline(arrays_zip(col, row))")
.groupBy()
.pivot($"col.name")
.agg(collect_list($"row"))
.selectExpr("inline(arrays_zip(Category, count_))")
result.show
+--------------+------+
| Category|count_|
+--------------+------+
|Administrative| 20839|
|Recommendation| 122|
| Alert| 64|
| ServiceHealth| 11|
+--------------+------+
要获得JSON输出,可以执行以下操作
val result_json = result.agg(to_json(collect_list(struct("Category", "count_"))).as("json"))
result_json.show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"Category":"Administrative","count_":"20839"},{"Category":"Recommendation","count_":"122"},{"Category":"Alert","count_":"64"},{"Category":"ServiceHealth","count_":"11"}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
或者,您可以将其另存为JSON,例如,使用
result.save.JSON(“输出”)
处理这个问题有点麻烦,但有一种方法:
val df = spark.read.option("multiline",true).json("filepath")
val result = df.select(explode($"tables").as("tables"))
.select($"tables.columns".as("col"), explode($"tables.rows").as("row"))
.selectExpr("inline(arrays_zip(col, row))")
.groupBy()
.pivot($"col.name")
.agg(collect_list($"row"))
.selectExpr("inline(arrays_zip(Category, count_))")
result.show
+--------------+------+
| Category|count_|
+--------------+------+
|Administrative| 20839|
|Recommendation| 122|
| Alert| 64|
| ServiceHealth| 11|
+--------------+------+
要获得JSON输出,可以执行以下操作
val result_json = result.agg(to_json(collect_list(struct("Category", "count_"))).as("json"))
result_json.show(false)
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|json |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|[{"Category":"Administrative","count_":"20839"},{"Category":"Recommendation","count_":"122"},{"Category":"Alert","count_":"64"},{"Category":"ServiceHealth","count_":"11"}]|
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
或者您可以另存为JSON,例如使用
result.save.JSON(“输出”)
另一种方法使用transform
函数:
import org.apache.spark.sql.functions._
val df = spark.read.option("multiline",true).json(inPath)
val df1 = df.withColumn("tables", explode($"tables"))
.select($"tables.rows".as("rows"))
.select(expr("inline(transform(rows, x -> struct(x[0] as Category, x[1] as _count)))"))
df1.show
//+--------------+------+
//| Category|_count|
//+--------------+------+
//|Administrative| 20839|
//|Recommendation| 122|
//| Alert| 64|
//| ServiceHealth| 11|
//+--------------+------+
然后保存到json文件:
df1.save.json(outPath)
使用
转换
功能的另一种方法:
import org.apache.spark.sql.functions._
val df = spark.read.option("multiline",true).json(inPath)
val df1 = df.withColumn("tables", explode($"tables"))
.select($"tables.rows".as("rows"))
.select(expr("inline(transform(rows, x -> struct(x[0] as Category, x[1] as _count)))"))
df1.show
//+--------------+------+
//| Category|_count|
//+--------------+------+
//|Administrative| 20839|
//|Recommendation| 122|
//| Alert| 64|
//| ServiceHealth| 11|
//+--------------+------+
然后保存到json文件:
df1.save.json(outPath)