Apache spark Spark DataFrame从子查询添加列
使用SQL语法,我可以使用子查询添加新列,如下所示:Apache spark Spark DataFrame从子查询添加列,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,使用SQL语法,我可以使用子查询添加新列,如下所示: import spark.sqlContext.implicits._ List( ("a", "1", "2"), ("b", "1", "3"), ("c", "1", "4"), ("d&q
import spark.sqlContext.implicits._
List(
("a", "1", "2"),
("b", "1", "3"),
("c", "1", "4"),
("d", "1", "5")
).toDF("name", "start", "end")
.createOrReplaceTempView("base")
List(
("a", "1", "2"),
("b", "2", "3"),
("c", "3", "4"),
("d", "4", "5"),
("f", "5", "6")
).toDF("name", "number", "_count")
.createOrReplaceTempView("col")
spark.sql(
"""
|select a.name,
| (select Max(_count) from col b where b.number == a.end) - (select Max(_count) from col b where b.number == a.start) as result
|from base a
|""".stripMargin)
.show(false)
如何使用DataFrame API做到这一点?我找到了语法:
import spark.sqlContext.implicits._
val b = List(
("a", "1", "2"),
("b", "1", "3"),
("c", "1", "4"),
("d", "1", "5")
).toDF("name", "start", "end")
List(
("a", "1", "2"),
("b", "2", "3"),
("c", "3", "4"),
("d", "4", "5"),
("f", "5", "6")
).toDF("name", "number", "_count")
.createOrReplaceTempView("ref_table")
b.withColumn("result", expr("((select max(_count) from ref_table r where r.number = end) - (select max(_count) from ref_table r where r.number = start)) as result")).show(false)
我认为max不是必需的,我们可以遵循以下方法
它使用join。非列
val base = List(
("a", "1", "2"),
("b", "1", "3"),
("c", "1", "4"),
("d", "1", "5")
).toDF("name", "start", "end")
val col = List(
("a", "1", "2"),
("b", "2", "3"),
("c", "3", "4"),
("d", "4", "5"),
("f", "5", "6")
).toDF("name", "number", "_count")
val df = base.join(col, col("number") === base("end")).select(base("name"), col("_count"))
val df1 = base.join(col, col("number") === base("start")).select(base("name").alias("nameDf"), col("_count").alias("count"))
df.join(df1, df("name") === df1("nameDf")).select($"name", ($"_count"- $"count").alias("result")).show(false)