将值添加到Spark DataFrame列中现有的嵌套json中
使用Spark 2.3.2 我试图使用数据帧中某些列的值,并将它们放入现有的JSON结构中。假设我有这个数据帧:将值添加到Spark DataFrame列中现有的嵌套json中,json,apache-spark,apache-spark-sql,Json,Apache Spark,Apache Spark Sql,使用Spark 2.3.2 我试图使用数据帧中某些列的值,并将它们放入现有的JSON结构中。假设我有这个数据帧: val testDF=Seq((“{”foo:“bar”,“meta:{”app1:{“p:“2”,“o:“100”),“app2:{”p:“5”,“o:“200”}}}}“,“10”,“1337”)).toDF(“key”,“p”,“o”) //用作嵌套json结构的键 val app=“appX” 基本上,我想从这个专栏 { "foo": "b
val testDF=Seq((“{”foo:“bar”,“meta:{”app1:{“p:“2”,“o:“100”),“app2:{”p:“5”,“o:“200”}}}}“,“10”,“1337”)).toDF(“key”,“p”,“o”)
//用作嵌套json结构的键
val app=“appX”
基本上,我想从这个专栏
{
"foo": "bar",
"meta": {
"app1": {
"p": "2",
"o": "100"
},
"app2": {
"p": "5",
"o": "200"
}
}
}
为此:
{
"meta": {
"app1": {
"p": "2",
"o": "100"
},
"app2": {
"p": "5",
"o": "200"
},
"appX": {
"p": "10",
"o": "1337"
}
}
}
基于数据帧的p
和o
列
我试过:
def进程(inputDF:DataFrame,appName:String):DataFrame={
val res=inputDF
.withColumn(appName,to_json(expr(“(p,o)”))
.withColumn(“meta”,结构(get_json_对象('key,$.meta)))
.selectExpr(s“”“结构(meta.*,${appName}作为${appName})作为myStruct“”)
.select(to_json('myStruct).as(“newMeta”))
res.show(假)
物件
}
val resultDF=进程(testDF,app)
val resultString=resultDF.select(“newMeta”).collectAsList().get(0).getString(0)
treatEscapes(resultString)必须是(“{”meta:{”app1:{”p:“2”,“o:“100”}”,app2:{”p:“5”,“o:“200”},appX:{”p:“10”,“o:“1337”}}”)
但是这个断言不匹配,因为我不能
- 将
的内容放入其他两个应用程序的相同级别appX
- 不知道如何正确处理引号,以及
- 不知道如何将“col1”重命名为“meta”
Expected :"{"[meta":{"app1":{"p":"2","o":"100"},"app2":{"p":"5","o":"200"},"appX":{"p":"10","o":"1337"}}]}"
Actual :"{"[col1":"{"app1":{"p":"2","o":"100"},"app2":{"p":"5","o":"200"}}","appX":"{"p":"10","o":"1337"}"]}"
meta
内容p
,o
列转换为map
数据类型。映射(lit(appX)、结构($“p”和$“o”))map\u concat
函数来压缩数据scala> testDF.show(false)
+---------------------------------------------------------------------------------+---+----+
|key |p |o |
+---------------------------------------------------------------------------------+---+----+
|{"foo": "bar", "meta":{"app1":{"p":"2", "o":"100"}, "app2":{"p":"5", "o":"200"}}}|10 |1337|
+---------------------------------------------------------------------------------+---+----+
创建schema
以将string
转换为json
scala> val schema = new StructType().add("foo",StringType).add("meta",MapType(StringType,new StructType().add("p",StringType).add("o",StringType)))
打印模式
scala> schema.printTreeString
root
|-- foo: string (nullable = true)
|-- meta: map (nullable = true)
| |-- key: string
| |-- value: struct (valueContainsNull = true)
| | |-- p: string (nullable = true)
| | |-- o: string (nullable = true)
最终输出
+-----------------------------------------------------------------------------------------------------------------+
|json_data |
+-----------------------------------------------------------------------------------------------------------------+
|{"key":{"foo":"bar","meta":{"app1":{"p":"2","o":"100"},"app2":{"p":"5","o":"200"},"appX":{"p":"10","o":"1337"}}}}|
+-----------------------------------------------------------------------------------------------------------------+
Spark版本>=2.4.0
使用UDF
&案例类帮助
定义案例类以保存p
和o
列值
scala> case class PO(p:String,o:String)
定义自定义项到concat映射
scala> val map_concat = udf((mp:Map[String,PO],mpa:Map[String,PO]) => mp ++ mpa)
最终产量
+-------------------------------------------+---+----+---------------------------------------------------------------------------------------------------------+
|key |p |o |newMap |
+-------------------------------------------+---+----+---------------------------------------------------------------------------------------------------------+
|[bar,Map(app1 -> [2,100], app2 -> [5,200])]|10 |1337|{"foo":"bar","meta":{"app1":{"p":"2","o":"100"},"app2":{"p":"5","o":"200"},"appX":{"p":"10","o":"1337"}}}|
+-------------------------------------------+---+----+---------------------------------------------------------------------------------------------------------+
它是2.3.2,添加到问题中这一行
。withColumn(“meta”,struct(get_json\u object('key,$.meta”))
是错误的,它没有展平meta
列值。
scala> df
.withColumn("key",from_json($"key",schema))
.withColumn(
"key",
to_json(
struct(
$"key.foo",
map_concat(
$"key.meta",
map(
lit(app),
struct($"p",$"o")
)
).as("meta")
)
)
)
.show(false)
+-------------------------------------------+---+----+---------------------------------------------------------------------------------------------------------+
|key |p |o |newMap |
+-------------------------------------------+---+----+---------------------------------------------------------------------------------------------------------+
|[bar,Map(app1 -> [2,100], app2 -> [5,200])]|10 |1337|{"foo":"bar","meta":{"app1":{"p":"2","o":"100"},"app2":{"p":"5","o":"200"},"appX":{"p":"10","o":"1337"}}}|
+-------------------------------------------+---+----+---------------------------------------------------------------------------------------------------------+