Dataframe SparkSQL超前/滞后函数中的动态/可变偏移_Dataframe_Apache Spark_Apache Spark Sql

Dataframe SparkSQL超前/滞后函数中的动态/可变偏移

dataframe apache-spark

Dataframe SparkSQL超前/滞后函数中的动态/可变偏移,dataframe,apache-spark,apache-spark-sql,Dataframe,Apache Spark,Apache Spark Sql,我们是否可以使用一个偏移量值，该值取决于spark SQL中超前/滞后函数中的列值示例：以下是工作正常的方法 val sampleData = Seq( ("bob","Developer",125000), ("mark","Developer",108000), ("carl","Tester",70000), ("peter","

我们是否可以使用一个偏移量值，该值取决于spark SQL中超前/滞后函数中的列值

示例：以下是工作正常的方法

val sampleData = Seq( ("bob","Developer",125000),
  ("mark","Developer",108000),
  ("carl","Tester",70000),
  ("peter","Developer",185000),
  ("jon","Tester",65000),
  ("roman","Tester",82000),
  ("simon","Developer",98000),
  ("eric","Developer",144000),
  ("carlos","Tester",75000),
  ("henry","Developer",110000)).toDF("Name","Role","Salary")

val window = Window.orderBy("Role")

//Derive lag column for salary
val laggingCol = lag(col("Salary"), 1).over(window)

//Use derived column LastSalary to find difference between current and previous row
val salaryDifference = col("Salary") - col("LastSalary")

//Calculate trend based on the difference
//IF ELSE / CASE can be written using when.otherwise in spark
val trend = when(col("SalaryDiff").isNull || col("SalaryDiff").===(0), "SAME")
  .when(col("SalaryDiff").>(0), "UP")
  .otherwise("DOWN")

sampleData.withColumn("LastSalary", laggingCol)
  .withColumn("SalaryDiff",salaryDifference)
  .withColumn("Trend", trend).show()

现在，我的用例是，我们必须传递的偏移量取决于Integer类型的特定列。这是我想做的工作：

val sampleData = Seq( ("bob","Developer",125000,2),
  ("mark","Developer",108000,3),
  ("carl","Tester",70000,3),
  ("peter","Developer",185000,2),
  ("jon","Tester",65000,1),
  ("roman","Tester",82000,1),
  ("simon","Developer",98000,2),
  ("eric","Developer",144000,3),
  ("carlos","Tester",75000,2),
  ("henry","Developer",110000,2)).toDF("Name","Role","Salary","ColumnForOffset")

val window = Window.orderBy("Role")

//Derive lag column for salary
val laggingCol = lag(col("Salary"), col("ColumnForOffset")).over(window)

//Use derived column LastSalary to find difference between current and previous row
val salaryDifference = col("Salary") - col("LastSalary")

//Calculate trend based on the difference
//IF ELSE / CASE can be written using when.otherwise in spark
val trend = when(col("SalaryDiff").isNull || col("SalaryDiff").===(0), "SAME")
  .when(col("SalaryDiff").>(0), "UP")
  .otherwise("DOWN")

sampleData.withColumn("LastSalary", laggingCol)
  .withColumn("SalaryDiff",salaryDifference)
  .withColumn("Trend", trend).show()

这将按预期引发异常，因为偏移量只接受整数值。

让我们讨论一下，我们是否可以以某种方式实现此逻辑。

您可以添加一个行号列，并根据行号和偏移量进行自联接，例如：

val df = sampleData.withColumn("rn", row_number().over(window))

val df2 = df.alias("t1").join(
    df.alias("t2"),
    expr("t1.rn = t2.rn + t1.ColumnForOffset"),
    "left"
).selectExpr("t1.*", "t2.Salary as LastSalary")

df2.show
+------+---------+------+---------------+---+----------+
|  Name|     Role|Salary|ColumnForOffset| rn|LastSalary|
+------+---------+------+---------------+---+----------+
|   bob|Developer|125000|              2|  1|      null|
|  mark|Developer|108000|              3|  2|      null|
| peter|Developer|185000|              2|  3|    125000|
| simon|Developer| 98000|              2|  4|    108000|
|  eric|Developer|144000|              3|  5|    108000|
| henry|Developer|110000|              2|  6|     98000|
|  carl|   Tester| 70000|              3|  7|     98000|
|   jon|   Tester| 65000|              1|  8|     70000|
| roman|   Tester| 82000|              1|  9|     65000|
|carlos|   Tester| 75000|              2| 10|     65000|
+------+---------+------+---------------+---+----------+

您好，如果我们还有一个需要分组的列呢？那么您可以使用.partitionBy将该列添加到您的窗口定义中