Scala Spark根据第一个数据集中的值更新第二个数据集中的值
我有两个spark数据集,一个包含accountid和key列,key列的格式为数组[key1,key2,key3..],另一个包含accountid和key值两列的数据集为json格式。accountid,{key:value,key,value…}。如果在第一个数据集中accountid出现键,我需要更新第二个数据集中的值Scala Spark根据第一个数据集中的值更新第二个数据集中的值,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有两个spark数据集,一个包含accountid和key列,key列的格式为数组[key1,key2,key3..],另一个包含accountid和key值两列的数据集为json格式。accountid,{key:value,key,value…}。如果在第一个数据集中accountid出现键,我需要更新第二个数据集中的值 import org.apache.spark.sql.functions._ val df= sc.parallelize(Seq(("2018061011404
import org.apache.spark.sql.functions._
val df= sc.parallelize(Seq(("20180610114049", "id1","key1"),
("20180610114049", "id2","key2"),
("20180610114049", "id1","key1"),
("20180612114049", "id2","key1"),
("20180613114049", "id3","key2"),
("20180613114049", "id3","key3")
)).toDF("date","accountid", "key")
val gp=df.groupBy("accountid","date").agg(collect_list("key"))
+---------+--------------+-----------------+
|accountid| date|collect_list(key)|
+---------+--------------+-----------------+
| id2|20180610114049| [key2]|
| id1|20180610114049| [key1, key1]|
| id3|20180613114049| [key2, key3]|
| id2|20180612114049| [key1]|
+---------+--------------+-----------------+
val df2= sc.parallelize(Seq(("20180610114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180610114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180611114049", "id1","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180612114049", "id2","{'key1':'0.0','key2':'0.0','key3':'0.0'}"),
("20180613114049", "id3","{'key1':'0.0','key2':'0.0','key3':'0.0'}")
)).toDF("date","accountid", "result")
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
+--------------+---------+----------------------------------------+
预期产量
+--------------+---------+----------------------------------------+
|date |accountid|result |
+--------------+---------+----------------------------------------+
|20180610114049|id1 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180610114049|id2 |{'key1':'0.0','key2':'1.0','key3':'0.0'}|
|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|
|20180612114049|id2 |{'key1':'1.0','key2':'0.0','key3':'0.0'}|
|20180613114049|id3 |{'key1':'0.0','key2':'1.0','key3':'1.0'}|
+--------------+---------+----------------------------------------+
在
连接两个数据帧之后,您可以使用udf
函数来实现您的需求。当然也有一些事情,比如将json转换为struct、再次将struct转换为json、case类用法等等(提供注释以作进一步解释)
其中结果是一个案例类
case class result(key1: String, key2: String, key3: String)
应该给你什么
+---------+--------------+----------------------------------------+
|accountid|date |result |
+---------+--------------+----------------------------------------+
|id3 |20180613114049|{"key1":"0.0","key2":"1.0","key3":"1.0"}|
|id1 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id1 |20180611114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180610114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"0.0","key2":"1.0","key3":"0.0"}|
|id2 |20180612114049|{"key1":"1.0","key2":"0.0","key3":"0.0"}|
+---------+--------------+----------------------------------------+
我希望答案是有帮助的在这里,你肯定需要一个UDF来干净利落地完成它
您可以在date
和accountid
加入后将数组和JSON传递给UDF,使用您选择的解析器解析UDF中的JSON(在本例中我使用JSON4S),检查数组中是否存在键,然后更改值,再次将其转换为JSON并从UDF返回
val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
joined.show(false)
//+--------------+---------+----------------------------------------+------------+
//|date |accountid|result |key |
//+--------------+---------+----------------------------------------+------------+
//|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2] |
//|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
//|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
//|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null |
//|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1] |
//+--------------+---------+----------------------------------------+------------+
// the UDF that will do the most work
// it's important to declare `formats` inside the function
// to avoid object not Serializable exception
// Not all cases are covered, use with caution :D
val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
import org.json4s.jackson.JsonMethods._
import org.json4s.JsonDSL._
implicit val format = org.json4s.DefaultFormats
// replace single quotes with double
val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
compact(render(updatedKV))
}
// Use when-otherwise and send empty array where `key` is null
joined.select($"date",
$"accountid",
when($"key".isNull, convertJsonValues($"result", array()))
.otherwise(convertJsonValues($"result", $"key"))
.as("result")
).show(false)
//+--------------+---------+----------------------------------------+
//|date |accountid|result |
//+--------------+---------+----------------------------------------+
//|20180610114049|id2 |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
//|20180613114049|id3 |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
//|20180610114049|id1 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//|20180611114049|id1 |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
//|20180612114049|id2 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//+--------------+---------+----------------------------------------+
我知道我在问题中没有提到这一点,可能会有更多的键,那么我不能将此解决方案用于具有三个常量键的case类。您可以根据需要增减@Masterbuilder:)
val gp=df.groupBy("accountid","date").agg(collect_list("key").as("key"))
val joined = df2.join(gp, Seq("date", "accountid") , "left_outer")
joined.show(false)
//+--------------+---------+----------------------------------------+------------+
//|date |accountid|result |key |
//+--------------+---------+----------------------------------------+------------+
//|20180610114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2] |
//|20180613114049|id3 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key2, key3]|
//|20180610114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1, key1]|
//|20180611114049|id1 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|null |
//|20180612114049|id2 |{'key1':'0.0','key2':'0.0','key3':'0.0'}|[key1] |
//+--------------+---------+----------------------------------------+------------+
// the UDF that will do the most work
// it's important to declare `formats` inside the function
// to avoid object not Serializable exception
// Not all cases are covered, use with caution :D
val convertJsonValues = udf{(json: String, arr: Seq[String]) =>
import org.json4s.jackson.JsonMethods._
import org.json4s.JsonDSL._
implicit val format = org.json4s.DefaultFormats
// replace single quotes with double
val kvMap = parse(json.replaceAll("'", """"""")).values.asInstanceOf[Map[String,String]]
val updatedKV = kvMap.map{ case(k,v) => if(arr.contains(k)) (k,"1.0") else (k,v) }
compact(render(updatedKV))
}
// Use when-otherwise and send empty array where `key` is null
joined.select($"date",
$"accountid",
when($"key".isNull, convertJsonValues($"result", array()))
.otherwise(convertJsonValues($"result", $"key"))
.as("result")
).show(false)
//+--------------+---------+----------------------------------------+
//|date |accountid|result |
//+--------------+---------+----------------------------------------+
//|20180610114049|id2 |{"key1":"0.0","key2":"1.0","key3":"0.0"}|
//|20180613114049|id3 |{"key1":"0.0","key2":"1.0","key3":"1.0"}|
//|20180610114049|id1 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//|20180611114049|id1 |{"key1":"0.0","key2":"0.0","key3":"0.0"}|
//|20180612114049|id2 |{"key1":"1.0","key2":"0.0","key3":"0.0"}|
//+--------------+---------+----------------------------------------+