Spark-scala窗口引导功能和案例陈述结果不符合预期
下面是数据帧Spark-scala窗口引导功能和案例陈述结果不符合预期,scala,apache-spark,Scala,Apache Spark,下面是数据帧 val df = Seq( ("Alice", 1,"2016-05-01"), ("Alice",1 ,"2016-05-03"), ("Alice", 2,"2016-05-04"), ("Bob", 3,"2016-05-01") ).toDF("name","value" ,"date") val windowSpec = Window.partitionBy("name").orderBy("date") df.withColumn("result",
val df = Seq(
("Alice", 1,"2016-05-01"),
("Alice",1 ,"2016-05-03"),
("Alice", 2,"2016-05-04"),
("Bob", 3,"2016-05-01")
).toDF("name","value" ,"date")
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
如果前导行的df(“值”)
相同(按名称划分
),我想说“Nochange”,否则从日期列中减去1天
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
此语句的输出是
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
+-----+-----+----------+----------+
| name|value| date| result|
+-----+-----+----------+----------+
|Alice| 1|2016-05-01| NOCHANGE|
|Alice| 1|2016-05-03|2016-05-02|
|Alice| 2|2016-05-04|2016-05-03|
| Bob| 3|2016-05-01|2016-04-30|
+-----+-----+----------+---------
但预期产出是有限的
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
+-----+-----+----------+----------+
| name|value| date| result|
+-----+-----+----------+----------+
|Alice| 1|2016-05-01| NOCHANGE|
|Alice| 1|2016-05-03|2016-05-02|
|Alice| 2|2016-05-04| NOCHANGE| //as it is last value of Alice partition
| Bob| 3|2016-05-01| NOCHANGE|//as no leading value in Bob partition
+-----+-----+----------+----------+
我做错什么了吗
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
另外,如果要比较多个列(value1、value2、value3),那么比较连续行的最佳方法是什么 这是因为lead(…,1)
在分区的最后一行返回null
,而您没有正确处理它们。见此:
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
df.withColumn("result" , lead(col("value"), 1).over(windowSpec)).show
+-----+-----+----------+------+
| name|value| date|result|
+-----+-----+----------+------+
|Alice| 1|2016-05-01| 1|
|Alice| 1|2016-05-03| 2|
|Alice| 2|2016-05-04| null|
| Bob| 3|2016-05-01| null|
+-----+-----+----------+------+
请尝试以下方法:
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
df.withColumn("result" , lead(col("value"), 1).over(windowSpec))
.withColumn("result",
when(col("result") === col("value") || col("result").isNull, "NOCHANGE")
.otherwise(date_sub(col("date"), 1))
).show
+-----+-----+----------+----------+
| name|value| date| result|
+-----+-----+----------+----------+
|Alice| 1|2016-05-01| NOCHANGE|
|Alice| 1|2016-05-03|2016-05-02|
|Alice| 2|2016-05-04| NOCHANGE|
| Bob| 3|2016-05-01| NOCHANGE|
+-----+-----+----------+----------+
如果要比较多个列,则需要多个result
列,然后使用&&
创建最终结果。可能是这样的:
val windowSpec = Window.partitionBy("name").orderBy("date")
df.withColumn("result",
when(lead(df("value"), 1).over(windowSpec) === df("value") , "NOCHANGE" )
.otherwise(date_sub(df("date"),1))
).show()
val df2 = ....toDF("name", "value1", "value2", "date")
df.withColumn("nextValue1", lead(col("value1"), 1).over(windowSpec))
.withColumn("nextValue2", lead(col("value2"), 1).over(windowSpec))
.withColumn("result",
when(
(col("nextValue1") === col("value1") && col("nextValue2") === col("value2")) || col("nextValue1").isNull,
"NOCHANGE"
).otherwise(date_sub(col("date"), 1))
).drop("nextValue1").drop("nextValue2")
谢谢@David Griffin.。如果我有多个列要比较,那么有没有关于比较行的有效方法的指针?几分钟前刚刚添加了这个。