当条件为真时应用Scala窗口函数,否则用最后一个值填充
为各种电子邮件ID提供一组事务。例如:当条件为真时应用Scala窗口函数,否则用最后一个值填充,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,为各种电子邮件ID提供一组事务。例如: val df = Seq( ("a@gmail.com", "2020-10-01 01:04:00", "txid-0", false), ("a@gmail.com", "2020-10-02 01:04:00", "txid-1", true), ("a@gmail.com"
val df = Seq(
("a@gmail.com", "2020-10-01 01:04:00", "txid-0", false),
("a@gmail.com", "2020-10-02 01:04:00", "txid-1", true),
("a@gmail.com", "2020-10-02 02:04:00", "txid-2", false),
("a@gmail.com", "2020-10-02 03:04:00", "txid-3", true),
("a@gmail.com", "2020-10-02 04:04:00", "txid-4", false),
("a@gmail.com", "2020-10-02 04:05:00", "txid-5", false),
("a@gmail.com", "2020-10-02 05:04:00", "txid-6", true),
("a@gmail.com", "2020-10-05 12:04:00", "txid-7", true),
("b@gmail.com", "2020-12-03 03:04:00", "txid-8", true),
("c@gmail.com", "2020-12-04 06:04:00", "txid-9", true)
).toDF("email", "timestamp", "transaction_id", "condition")
我希望得到的是过去24小时内按电子邮件分组的条件为真的交易计数。如果condition
为false,我只希望count
列包含condition
为true的最后一个良好计数。对于以上内容,以下是结果:
val expectedDF = Seq(
("a@gmail.com", "2020-10-01 01:04:00", "txid-0", false, 0),
("a@gmail.com", "2020-10-02 01:04:00", "txid-1", true, 1),
("a@gmail.com", "2020-10-02 02:04:00", "txid-2", false, 1),// copy last count since condition is false
("a@gmail.com", "2020-10-02 03:04:00", "txid-3", true, 2),
("a@gmail.com", "2020-10-02 04:04:00", "txid-4", false, 2),// copy last count since condition is false
("a@gmail.com", "2020-10-02 04:05:00", "txid-5", false, 2),// copy last count since condition is false
("a@gmail.com", "2020-10-02 05:04:00", "txid-6", true, 3),
("a@gmail.com", "2020-10-05 12:04:00", "txid-7", true, 1), // beyond 24 hrs from prev transaction
("b@gmail.com", "2020-12-03 03:04:00", "txid-8", true, 1), // new email
("c@gmail.com", "2020-12-04 06:04:00", "txid-9", true, 1) // new email
).toDF("email", "timestamp", "transaction_id", "condition", "count")
到目前为止,我所做的是:
val new_df = df
.withColumn("transaction_timestamp", unix_timestamp($"timestamp").cast(LongType))
val winSpec = Window
.partitionBy("email")
.orderBy(col("transaction_timestamp"))
.rangeBetween(-24*3600, Window.currentRow)
val resultDF = new_df
.filter(col("condition"))
.withColumn("count", count(col("email")).over(winSpec))
resultDF.show()
这将打印以下内容,其中没有带条件的行
==false条件,但我希望所有行都具有正确的计数值,如expectedDF
:
("email", | "timestamp" | "transaction_id" | "condition" | "count")
("a@gmail.com", "2020-10-02 01:04:00", "txid-1", true, 1),
("a@gmail.com", "2020-10-02 03:04:00", "txid-3", true, 2),
("a@gmail.com", "2020-10-02 05:04:00", "txid-6", true, 3),
("a@gmail.com", "2020-10-05 12:04:00", "txid-7", true, 1),
("b@gmail.com", "2020-12-03 03:04:00", "txid-8", true, 1),
("c@gmail.com", "2020-12-04 06:04:00", "txid-9", true, 1)
我无法找到一种方法来应用窗口函数,使其仅在条件为真时进行计算,否则在条件为真时复制最后一个良好值。如果有任何帮助,我们将不胜感激。不要过滤,只要在
时使用使用条件表达式即可
val resultDF = new_df
.withColumn("count", count(when(col("condition"), col("email"))).over(winSpec))
resultDF.show()
+-----------+-------------------+--------------+---------+---------------------+-----+
| email| timestamp|transaction_id|condition|transaction_timestamp|count|
+-----------+-------------------+--------------+---------+---------------------+-----+
|a@gmail.com|2020-10-01 01:04:00| txid-0| false| 1.60151424E9| 0|
|a@gmail.com|2020-10-02 01:04:00| txid-1| true| 1.60160064E9| 1|
|a@gmail.com|2020-10-02 02:04:00| txid-2| false| 1.60160424E9| 1|
|a@gmail.com|2020-10-02 03:04:00| txid-3| true| 1.60160784E9| 2|
|a@gmail.com|2020-10-02 04:04:00| txid-4| false| 1.60161144E9| 2|
|a@gmail.com|2020-10-02 04:05:00| txid-5| false| 1.6016115E9| 2|
|a@gmail.com|2020-10-02 05:04:00| txid-6| true| 1.60161504E9| 3|
|a@gmail.com|2020-10-05 12:04:00| txid-7| true| 1.60189944E9| 1|
|c@gmail.com|2020-12-04 06:04:00| txid-9| true| 1.60706184E9| 1|
|b@gmail.com|2020-12-03 03:04:00| txid-8| true| 1.60696464E9| 1|
+-----------+-------------------+--------------+---------+---------------------+-----+
不要过滤,只需在
时使用,即可使用条件表达式
val resultDF = new_df
.withColumn("count", count(when(col("condition"), col("email"))).over(winSpec))
resultDF.show()
+-----------+-------------------+--------------+---------+---------------------+-----+
| email| timestamp|transaction_id|condition|transaction_timestamp|count|
+-----------+-------------------+--------------+---------+---------------------+-----+
|a@gmail.com|2020-10-01 01:04:00| txid-0| false| 1.60151424E9| 0|
|a@gmail.com|2020-10-02 01:04:00| txid-1| true| 1.60160064E9| 1|
|a@gmail.com|2020-10-02 02:04:00| txid-2| false| 1.60160424E9| 1|
|a@gmail.com|2020-10-02 03:04:00| txid-3| true| 1.60160784E9| 2|
|a@gmail.com|2020-10-02 04:04:00| txid-4| false| 1.60161144E9| 2|
|a@gmail.com|2020-10-02 04:05:00| txid-5| false| 1.6016115E9| 2|
|a@gmail.com|2020-10-02 05:04:00| txid-6| true| 1.60161504E9| 3|
|a@gmail.com|2020-10-05 12:04:00| txid-7| true| 1.60189944E9| 1|
|c@gmail.com|2020-12-04 06:04:00| txid-9| true| 1.60706184E9| 1|
|b@gmail.com|2020-12-03 03:04:00| txid-8| true| 1.60696464E9| 1|
+-----------+-------------------+--------------+---------+---------------------+-----+
为true、false创建一个值为1,0的额外列,并使用累积和<代码>过滤器(列(“条件”))
可以removed@undefined_variable,谢谢你的评论。如何在累积和计算中使用额外列?你能补充一个答案或补充一个评论吗?接受的答案对我有用,但我有几个额外的用例,我可以应用你的建议/方法。创建一个额外的列,值为1,0表示真、假,并使用累积和<代码>过滤器(列(“条件”))可以removed@undefined_variable,谢谢你的评论。如何在累积和计算中使用额外列?你能补充一个答案或补充一个评论吗?被接受的答案对我很有用,但我有一些额外的用例,我可以应用你的建议/方法。当我用sum
代替count
时,这不起作用。在这种情况下,它将打印null
。对于sum
也有类似的解决方案吗?当我用sum
代替count
时,这不起作用。在这种情况下,它将打印null
。sum
是否也有类似的解决方案?