Apache spark 使用优化的DSL为spark中的每一行生成2行_Apache Spark_Apache Spark Sql_Explode_Flatmap

Apache spark 使用优化的DSL为spark中的每一行生成2行

apache-spark

Apache spark 使用优化的DSL为spark中的每一行生成2行,apache-spark,apache-spark-sql,explode,flatmap,Apache Spark,Apache Spark Sql,Explode,Flatmap,我有如下数据： id,ts_start,ts_end,foo_start,foo_end 1,1,2,f_s,f_e 2,3,4,foo,bar 3,3,6,foo,f_e 也就是说，一个包含所有开始和结束信息的单一记录。使用平面贴图，这些可以转换为 id,ts,foo 1,1,f_s 1,2,f_e 如何使用带有explode或pivot的优化SQL DSL实现同样的功能编辑显然，我不想把数据读入两次并合并结果或者，如果我不想使用flatmap+serde+自定义代码，这是唯一的选

我有如下数据：

id,ts_start,ts_end,foo_start,foo_end
1,1,2,f_s,f_e
2,3,4,foo,bar
3,3,6,foo,f_e

也就是说，一个包含所有开始和结束信息的单一记录。使用平面贴图，这些可以转换为

id,ts,foo
1,1,f_s
1,2,f_e

如何使用带有

explode

或

pivot

的优化SQL DSL实现同样的功能

编辑显然，我不想把数据读入两次并合并结果

或者，如果我不想使用flatmap+serde+自定义代码，这是唯一的选项吗？

给定：

val df = Seq(
  (1,1,2,"f_s","f_e"),
  (2,3,4,"foo","bar"),
  (3,3,6,"foo","f_e")
).toDF("id","ts_start","ts_end","foo_start","foo_end")

你可以做：

df
  .select($"id",
    explode(
      array(
       struct($"ts_start".as("ts"),$"foo_start".as("foo")),
       struct($"ts_end".as("ts"),$"foo_end".as("foo"))
     )
    ).as("tmp")
  )
  .select(
    $"id",
    $"tmp.*"
  )
  .show()

其中：

+---+---+---+
| id| ts|foo|
+---+---+---+
|  1|  1|f_s|
|  1|  2|f_e|
|  2|  3|foo|
|  2|  4|bar|
|  3|  3|foo|
|  3|  6|f_e|
+---+---+---+

鉴于：

你可以做：

df
  .select($"id",
    explode(
      array(
       struct($"ts_start".as("ts"),$"foo_start".as("foo")),
       struct($"ts_end".as("ts"),$"foo_end".as("foo"))
     )
    ).as("tmp")
  )
  .select(
    $"id",
    $"tmp.*"
  )
  .show()

其中：

+---+---+---+
| id| ts|foo|
+---+---+---+
|  1|  1|f_s|
|  1|  2|f_e|
|  2|  3|foo|
|  2|  4|bar|
|  3|  3|foo|
|  3|  6|f_e|
+---+---+---+