Scala 具有嵌套聚合的数据帧
我有一个json文件,如下所示:Scala 具有嵌套聚合的数据帧,scala,apache-spark,Scala,Apache Spark,我有一个json文件,如下所示: {{"name":"jonh", "food":"tomato", "weight": 1}, {"name":"jonh", "food":"carrot", "weight": 4}, {"name":"bill", "food":"apple", "weight": 1}, {"name":"john", "food":"tomato", "weight": 2}, {"name":"bill", "food":"taco", "weight":
{{"name":"jonh", "food":"tomato", "weight": 1},
{"name":"jonh", "food":"carrot", "weight": 4},
{"name":"bill", "food":"apple", "weight": 1},
{"name":"john", "food":"tomato", "weight": 2},
{"name":"bill", "food":"taco", "weight": 2}},
{"name":"bill", "food":"taco", "weight": 4}},
{
{"name":"jonh",
"buy": [{"tomato": 3},{"carrot": 4}]
},
{"name":"bill",
"buy": [{"apple": 1},{"taco": 6}]
}
}
我需要创建新的json,如下所示:
{{"name":"jonh", "food":"tomato", "weight": 1},
{"name":"jonh", "food":"carrot", "weight": 4},
{"name":"bill", "food":"apple", "weight": 1},
{"name":"john", "food":"tomato", "weight": 2},
{"name":"bill", "food":"taco", "weight": 2}},
{"name":"bill", "food":"taco", "weight": 4}},
{
{"name":"jonh",
"buy": [{"tomato": 3},{"carrot": 4}]
},
{"name":"bill",
"buy": [{"apple": 1},{"taco": 6}]
}
}
这是我的数据帧
val df = Seq(
("john", "tomato", 1),
("john", "carrot", 4),
("bill", "apple", 1),
("john", "tomato", 2),
("bill", "taco", 2),
("bill", "taco", 4)
).toDF("name", "food", "weight")
如何获得具有最终结构的数据帧groupBy和agg给了我错误的结构
import org.apache.spark.sql.functions._
df.groupBy("name", "food").agg(sum("weight").as("weight"))
.groupBy("name").agg(collect_list(struct("food", "weight")).as("acc"))
+----+------------------------+
|name|acc |
+----+------------------------+
|john|[[carrot,4], [tomato,3]]|
|bill|[[taco,6], [apple,1]] |
+----+------------------------+
{"name":"john","acc":[{"food":"carrot","weight":4},{"food":"tomato","weight":3}]}
{"name":"bill","acc":[{"food":"taco","weight":6},{"food":"apple","weight":1}]}
请给我正确的解决方法。您可以通过迭代
行
s,组装食物
-重量
对,然后将其转换为地图
val step1 = df.groupBy("name", "food").agg(sum("weight").as("weight")).
groupBy("name").agg(collect_list(struct("food", "weight")).as("buy"))
val result = step1.map(row =>
(row.getAs[String]("name"), row.getAs[Seq[Row]]("buy").map(map =>
map.getAs[String]("food") -> map.getAs[Long]("weight")).toMap)
).toDF("name", "buy")
result.toJSON.show(false)
+---------------------------------------------+
|{"name":"john","buy":{"carrot":4,"tomato":3}}|
|{"name":"bill","buy":{"taco":6,"apple":1}} |
+---------------------------------------------+
您可以通过使用替换技术获得所需的json格式
udf方式
udf
函数作用于原始数据类型,因此replace
函数可用于替换来自最终数据框的食物和重量字符串
import org.apache.spark.sql.functions._
def replaeUdf = udf((json: String) => json.replace("\"food\":", "").replace("\"weight\":", ""))
val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
.groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
.toJSON.withColumn("value", replaeUdf(col("value")))
+-------------------------------------------------+
|value |
+-------------------------------------------------+
|{"name":"john","buy":[{"carrot",4},{"tomato",3}]}|
|{"name":"bill","buy":[{"taco",6},{"apple",1}]} |
+-------------------------------------------------+
您应该将输出dataframe
设置为
import org.apache.spark.sql.functions._
def replaeUdf = udf((json: String) => json.replace("\"food\":", "").replace("\"weight\":", ""))
val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
.groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
.toJSON.withColumn("value", replaeUdf(col("value")))
+-------------------------------------------------+
|value |
+-------------------------------------------------+
|{"name":"john","buy":[{"carrot",4},{"tomato",3}]}|
|{"name":"bill","buy":[{"taco",6},{"apple",1}]} |
+-------------------------------------------------+
正则表达式替换函数
regex\u replace
内置函数也可用于获得所需的输出
val temp = df.groupBy("name", "food").agg(sum("weight").as("weight"))
.groupBy("name").agg(collect_list(struct(col("food"), col("weight"))).as("buy"))
.toJSON.withColumn("value", regexp_replace(regexp_replace(col("value"), "\"food\":", ""), "\"weight\":", ""))
看起来很棒。我想了想地图,但不太明白到底是怎么回事。我会检查并更新。谢谢