展平+;(~self-join)Scala中具有struct数组的spark数据帧
输入数据帧:展平+;(~self-join)Scala中具有struct数组的spark数据帧,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,输入数据帧: { "F1" : "A", "F2" : "B", "F3" : [ { "name" : "N1", "sf1" : "val_1", "sf2" : "
{
"F1" : "A",
"F2" : "B",
"F3" : [
{
"name" : "N1",
"sf1" : "val_1",
"sf2" : "val_2"
},
{
"name" : "N2",
"sf1" : "val_3",
"sf2" : "val_4"
}
],
"F4" : {
"SF1" : "val_5",
"SF2" : "val_6",
"SF3" : "val_7"
}
}
期望输出:
[
{
"F1" : "A",
"F2" : "B",
"F3_name" : "N1",
"F3_sf1" : "val_1",
"F3_sf2" : "val_2",
"F4_SF1" : "val_7",
"F4_SF2" : "val_8",
"F4_SF3" : "val_9",
},
{
"F1" : "A",
"F2" : "B",
"F3_name" : "N2",
"F3_sf1" : "val_3",
"F3_sf2" : "val_4",
"F4_SF1" : "val_7",
"F4_SF2" : "val_8",
"F4_SF3" : "val_9",
}
]
F3
是一个结构数组。新的数据框应该是平面的,并根据F3
中的项目数将这一行转换为一行或多行(本例中为2行)
我是Spark&Scala的新手。任何关于如何实现这一转变的想法都将非常有用
谢谢 您可以使用
inline
分解并展开F3,使用*
展开F4:
val df2 = df.selectExpr("F1","F2","inline(F3)","F4.*")
df2.show
+---+---+----+-----+-----+-----+-----+-----+
| F1| F2|name| sf1| sf2| SF1| SF2| SF3|
+---+---+----+-----+-----+-----+-----+-----+
| A| B| N1|val_1|val_2|val_5|val_6|val_7|
| A| B| N2|val_3|val_4|val_5|val_6|val_7|
+---+---+----+-----+-----+-----+-----+-----+
您可以使用
inline
分解并展开F3,使用*
展开F4:
val df2 = df.selectExpr("F1","F2","inline(F3)","F4.*")
df2.show
+---+---+----+-----+-----+-----+-----+-----+
| F1| F2|name| sf1| sf2| SF1| SF2| SF3|
+---+---+----+-----+-----+-----+-----+-----+
| A| B| N1|val_1|val_2|val_5|val_6|val_7|
| A| B| N2|val_3|val_4|val_5|val_6|val_7|
+---+---+----+-----+-----+-----+-----+-----+
您也可以先使用
分解
。然后,您可以使用一系列别名提取并重命名字段(例如,$“F3.name”作为“F3\u name”
):
您也可以先使用
分解
。然后,您可以使用一系列别名提取并重命名字段(例如,$“F3.name”作为“F3\u name”
):