Apache spark Spark DataFrame从子查询添加列

Apache spark Spark DataFrame从子查询添加列,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,使用SQL语法,我可以使用子查询添加新列,如下所示: import spark.sqlContext.implicits._ List( ("a", "1", "2"), ("b", "1", "3"), ("c", "1", "4"), ("d&q

使用SQL语法,我可以使用子查询添加新列,如下所示:

import spark.sqlContext.implicits._

     List(
      ("a", "1", "2"),
      ("b", "1", "3"),
      ("c", "1", "4"),
      ("d", "1", "5")
    ).toDF("name", "start", "end")
        .createOrReplaceTempView("base")

    List(
      ("a", "1", "2"),
      ("b", "2", "3"),
      ("c", "3", "4"),
      ("d", "4", "5"),
      ("f", "5", "6")
    ).toDF("name", "number", "_count")
      .createOrReplaceTempView("col")


   spark.sql(
     """
       |select a.name,
       |       (select Max(_count) from col b where b.number == a.end) - (select Max(_count) from col b where b.number == a.start) as result
       |from base a
       |""".stripMargin)
      .show(false)
如何使用DataFrame API做到这一点?

我找到了语法:

import spark.sqlContext.implicits._

     val b = List(
      ("a", "1", "2"),
      ("b", "1", "3"),
      ("c", "1", "4"),
      ("d", "1", "5")
    ).toDF("name", "start", "end")

    List(
      ("a", "1", "2"),
      ("b", "2", "3"),
      ("c", "3", "4"),
      ("d", "4", "5"),
      ("f", "5", "6")
    ).toDF("name", "number", "_count")
      .createOrReplaceTempView("ref_table")


    b.withColumn("result", expr("((select max(_count) from ref_table r where r.number = end) - (select max(_count) from ref_table r where r.number = start)) as result")).show(false)
我认为max不是必需的,我们可以遵循以下方法
它使用join。非列
    val base = List(
    ("a", "1", "2"),
    ("b", "1", "3"),
    ("c", "1", "4"),
    ("d", "1", "5")
    ).toDF("name", "start", "end")

    val col = List(
    ("a", "1", "2"),
    ("b", "2", "3"),
    ("c", "3", "4"),
    ("d", "4", "5"),
    ("f", "5", "6")
    ).toDF("name", "number", "_count")

    val df = base.join(col, col("number") === base("end")).select(base("name"), col("_count"))

    val df1 = base.join(col, col("number") === base("start")).select(base("name").alias("nameDf"), col("_count").alias("count"))

    df.join(df1, df("name") === df1("nameDf")).select($"name", ($"_count"- $"count").alias("result")).show(false)