Spark-scala窗口引导功能和案例陈述结果不符合预期

Spark-scala窗口引导功能和案例陈述结果不符合预期,scala,apache-spark,Scala,Apache Spark,下面是数据帧 val df = Seq( ("Alice", 1,"2016-05-01"), ("Alice",1 ,"2016-05-03"), ("Alice", 2,"2016-05-04"), ("Bob", 3,"2016-05-01") ).toDF("name","value" ,"date") val windowSpec = Window.partitionBy("name").orderBy("date") df.withColumn("result",

下面是数据帧

val df = Seq(
  ("Alice", 1,"2016-05-01"),
  ("Alice",1 ,"2016-05-03"),
  ("Alice", 2,"2016-05-04"),
  ("Bob", 3,"2016-05-01")
).toDF("name","value" ,"date")
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
如果前导行的
df(“值”)
相同(按
名称划分
),我想说“Nochange”,否则从日期列中减去1天

val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
此语句的输出是

val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
+-----+-----+----------+----------+
| name|value|      date|    result|
+-----+-----+----------+----------+
|Alice|    1|2016-05-01|  NOCHANGE|
|Alice|    1|2016-05-03|2016-05-02|
|Alice|    2|2016-05-04|2016-05-03|
|  Bob|    3|2016-05-01|2016-04-30|
+-----+-----+----------+---------
但预期产出是有限的

val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
+-----+-----+----------+----------+
| name|value|      date|    result|
+-----+-----+----------+----------+
|Alice|    1|2016-05-01|  NOCHANGE|
|Alice|    1|2016-05-03|2016-05-02|
|Alice|    2|2016-05-04|  NOCHANGE| //as it is last value of Alice partition
|  Bob|    3|2016-05-01|  NOCHANGE|//as no leading value in Bob partition
+-----+-----+----------+----------+
我做错什么了吗

val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
另外,如果要比较多个列(value1、value2、value3),那么比较连续行的最佳方法是什么

这是因为
lead(…,1)
在分区的最后一行返回
null
,而您没有正确处理它们。见此:

val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
df.withColumn("result" , lead(col("value"), 1).over(windowSpec)).show
+-----+-----+----------+------+
| name|value|      date|result|
+-----+-----+----------+------+
|Alice|    1|2016-05-01|     1|
|Alice|    1|2016-05-03|     2|
|Alice|    2|2016-05-04|  null|
|  Bob|    3|2016-05-01|  null|
+-----+-----+----------+------+
请尝试以下方法:

val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
df.withColumn("result" , lead(col("value"), 1).over(windowSpec))
  .withColumn("result",
    when(col("result") === col("value") || col("result").isNull, "NOCHANGE")
    .otherwise(date_sub(col("date"), 1))
  ).show

+-----+-----+----------+----------+
| name|value|      date|    result|
+-----+-----+----------+----------+
|Alice|    1|2016-05-01|  NOCHANGE|
|Alice|    1|2016-05-03|2016-05-02|
|Alice|    2|2016-05-04|  NOCHANGE|
|  Bob|    3|2016-05-01|  NOCHANGE|
+-----+-----+----------+----------+
如果要比较多个列,则需要多个
result
列,然后使用
&&
创建最终结果。可能是这样的:

val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result", 
  when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
  .otherwise(date_sub(df("date"),1))
).show()
val df2 = ....toDF("name", "value1", "value2", "date")

df.withColumn("nextValue1", lead(col("value1"), 1).over(windowSpec))
  .withColumn("nextValue2", lead(col("value2"), 1).over(windowSpec))
  .withColumn("result",
    when(
      (col("nextValue1") === col("value1") && col("nextValue2") === col("value2")) || col("nextValue1").isNull,
      "NOCHANGE"
    ).otherwise(date_sub(col("date"), 1))
  ).drop("nextValue1").drop("nextValue2")

谢谢@David Griffin.。如果我有多个列要比较,那么有没有关于比较行的有效方法的指针?几分钟前刚刚添加了这个。