展平+;(~self-join)Scala中具有struct数组的spark数据帧

展平+;(~self-join)Scala中具有struct数组的spark数据帧,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,输入数据帧: { "F1" : "A", "F2" : "B", "F3" : [ { "name" : "N1", "sf1" : "val_1", "sf2" : "

输入数据帧:

{
  "F1" : "A",
  "F2" : "B",
  "F3" : [
            {
              "name" : "N1",
              "sf1" : "val_1",
              "sf2" : "val_2"
            },
            {
              "name" : "N2",
              "sf1" : "val_3",
              "sf2" : "val_4"
            }
         ],
  "F4" : {
        "SF1" : "val_5",
        "SF2" : "val_6",
        "SF3" : "val_7"
  }
}
期望输出:

[
  {
    "F1" : "A",
    "F2" : "B",

    "F3_name" : "N1",
    "F3_sf1" : "val_1",
    "F3_sf2" : "val_2",
    
    "F4_SF1" : "val_7",
    "F4_SF2" : "val_8",
    "F4_SF3" : "val_9",
  },
  {
    "F1" : "A",
    "F2" : "B",

    "F3_name" : "N2",
    "F3_sf1" : "val_3",
    "F3_sf2" : "val_4",
    
    "F4_SF1" : "val_7",
    "F4_SF2" : "val_8",
    "F4_SF3" : "val_9",
  }
]
F3
是一个结构数组。新的数据框应该是平面的,并根据
F3
中的项目数将这一行转换为一行或多行(本例中为2行)

我是Spark&Scala的新手。任何关于如何实现这一转变的想法都将非常有用


谢谢

您可以使用
inline
分解并展开F3,使用
*
展开F4:

val df2 = df.selectExpr("F1","F2","inline(F3)","F4.*")

df2.show
+---+---+----+-----+-----+-----+-----+-----+
| F1| F2|name|  sf1|  sf2|  SF1|  SF2|  SF3|
+---+---+----+-----+-----+-----+-----+-----+
|  A|  B|  N1|val_1|val_2|val_5|val_6|val_7|
|  A|  B|  N2|val_3|val_4|val_5|val_6|val_7|
+---+---+----+-----+-----+-----+-----+-----+

您可以使用
inline
分解并展开F3,使用
*
展开F4:

val df2 = df.selectExpr("F1","F2","inline(F3)","F4.*")

df2.show
+---+---+----+-----+-----+-----+-----+-----+
| F1| F2|name|  sf1|  sf2|  SF1|  SF2|  SF3|
+---+---+----+-----+-----+-----+-----+-----+
|  A|  B|  N1|val_1|val_2|val_5|val_6|val_7|
|  A|  B|  N2|val_3|val_4|val_5|val_6|val_7|
+---+---+----+-----+-----+-----+-----+-----+

您也可以先使用
分解
。然后,您可以使用一系列别名提取并重命名字段(例如,
$“F3.name”作为“F3\u name”
):


您也可以先使用
分解
。然后,您可以使用一系列别名提取并重命名字段(例如,
$“F3.name”作为“F3\u name”
):