Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/8/vim/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
当条件为真时应用Scala窗口函数,否则用最后一个值填充_Scala_Dataframe_Apache Spark_Apache Spark Sql - Fatal编程技术网

当条件为真时应用Scala窗口函数,否则用最后一个值填充

当条件为真时应用Scala窗口函数,否则用最后一个值填充,scala,dataframe,apache-spark,apache-spark-sql,Scala,Dataframe,Apache Spark,Apache Spark Sql,为各种电子邮件ID提供一组事务。例如: val df = Seq( ("a@gmail.com", "2020-10-01 01:04:00", "txid-0", false), ("a@gmail.com", "2020-10-02 01:04:00", "txid-1", true), ("a@gmail.com"

为各种电子邮件ID提供一组事务。例如:

  val df = Seq(
      ("a@gmail.com", "2020-10-01 01:04:00", "txid-0", false),
      ("a@gmail.com", "2020-10-02 01:04:00", "txid-1", true),
      ("a@gmail.com", "2020-10-02 02:04:00", "txid-2", false),
      ("a@gmail.com", "2020-10-02 03:04:00", "txid-3", true),
      ("a@gmail.com", "2020-10-02 04:04:00", "txid-4", false),
      ("a@gmail.com", "2020-10-02 04:05:00", "txid-5", false),
      ("a@gmail.com", "2020-10-02 05:04:00", "txid-6", true),
      ("a@gmail.com", "2020-10-05 12:04:00", "txid-7", true),
      ("b@gmail.com", "2020-12-03 03:04:00", "txid-8", true),
      ("c@gmail.com", "2020-12-04 06:04:00", "txid-9", true)
    ).toDF("email", "timestamp", "transaction_id", "condition")
我希望得到的是过去24小时内按电子邮件分组的
条件为真的交易计数。如果
condition
为false,我只希望
count
列包含
condition
为true的最后一个良好计数。对于以上内容,以下是结果:

val expectedDF = Seq(
  ("a@gmail.com", "2020-10-01 01:04:00", "txid-0", false, 0),
  ("a@gmail.com", "2020-10-02 01:04:00", "txid-1", true, 1),
  ("a@gmail.com", "2020-10-02 02:04:00", "txid-2", false, 1),// copy last count since condition is false
  ("a@gmail.com", "2020-10-02 03:04:00", "txid-3", true, 2),
  ("a@gmail.com", "2020-10-02 04:04:00", "txid-4", false, 2),// copy last count since condition is false
  ("a@gmail.com", "2020-10-02 04:05:00", "txid-5", false, 2),// copy last count since condition is false
  ("a@gmail.com", "2020-10-02 05:04:00", "txid-6", true, 3),
  ("a@gmail.com", "2020-10-05 12:04:00", "txid-7", true, 1), // beyond 24 hrs from prev transaction
  ("b@gmail.com", "2020-12-03 03:04:00", "txid-8", true, 1), // new email
  ("c@gmail.com", "2020-12-04 06:04:00", "txid-9", true, 1) // new email
).toDF("email", "timestamp", "transaction_id", "condition", "count")
到目前为止,我所做的是:

    val new_df = df
      .withColumn("transaction_timestamp", unix_timestamp($"timestamp").cast(LongType))

    val winSpec = Window
      .partitionBy("email")
      .orderBy(col("transaction_timestamp"))
      .rangeBetween(-24*3600, Window.currentRow)

    val resultDF = new_df
      .filter(col("condition"))
      .withColumn("count", count(col("email")).over(winSpec))

    resultDF.show()
这将打印以下内容,其中没有带
条件的行
==false条件,但我希望所有行都具有正确的计数值,如
expectedDF

("email",      | "timestamp"         | "transaction_id" | "condition" | "count")
("a@gmail.com", "2020-10-02 01:04:00", "txid-1",           true,            1),
("a@gmail.com", "2020-10-02 03:04:00", "txid-3",           true,            2),
("a@gmail.com", "2020-10-02 05:04:00", "txid-6",           true,            3),
("a@gmail.com", "2020-10-05 12:04:00", "txid-7",           true,            1),
("b@gmail.com", "2020-12-03 03:04:00", "txid-8",           true,            1),
("c@gmail.com", "2020-12-04 06:04:00", "txid-9",           true,            1)

我无法找到一种方法来应用窗口函数,使其仅在条件为真时进行计算,否则在条件为真时复制最后一个良好值。如果有任何帮助,我们将不胜感激。

不要过滤,只要在
时使用
使用条件表达式即可

val resultDF = new_df
  .withColumn("count", count(when(col("condition"), col("email"))).over(winSpec))

resultDF.show()

+-----------+-------------------+--------------+---------+---------------------+-----+
|      email|          timestamp|transaction_id|condition|transaction_timestamp|count|
+-----------+-------------------+--------------+---------+---------------------+-----+
|a@gmail.com|2020-10-01 01:04:00|        txid-0|    false|         1.60151424E9|    0|
|a@gmail.com|2020-10-02 01:04:00|        txid-1|     true|         1.60160064E9|    1|
|a@gmail.com|2020-10-02 02:04:00|        txid-2|    false|         1.60160424E9|    1|
|a@gmail.com|2020-10-02 03:04:00|        txid-3|     true|         1.60160784E9|    2|
|a@gmail.com|2020-10-02 04:04:00|        txid-4|    false|         1.60161144E9|    2|
|a@gmail.com|2020-10-02 04:05:00|        txid-5|    false|          1.6016115E9|    2|
|a@gmail.com|2020-10-02 05:04:00|        txid-6|     true|         1.60161504E9|    3|
|a@gmail.com|2020-10-05 12:04:00|        txid-7|     true|         1.60189944E9|    1|
|c@gmail.com|2020-12-04 06:04:00|        txid-9|     true|         1.60706184E9|    1|
|b@gmail.com|2020-12-03 03:04:00|        txid-8|     true|         1.60696464E9|    1|
+-----------+-------------------+--------------+---------+---------------------+-----+

不要过滤,只需在
时使用
,即可使用条件表达式

val resultDF = new_df
  .withColumn("count", count(when(col("condition"), col("email"))).over(winSpec))

resultDF.show()

+-----------+-------------------+--------------+---------+---------------------+-----+
|      email|          timestamp|transaction_id|condition|transaction_timestamp|count|
+-----------+-------------------+--------------+---------+---------------------+-----+
|a@gmail.com|2020-10-01 01:04:00|        txid-0|    false|         1.60151424E9|    0|
|a@gmail.com|2020-10-02 01:04:00|        txid-1|     true|         1.60160064E9|    1|
|a@gmail.com|2020-10-02 02:04:00|        txid-2|    false|         1.60160424E9|    1|
|a@gmail.com|2020-10-02 03:04:00|        txid-3|     true|         1.60160784E9|    2|
|a@gmail.com|2020-10-02 04:04:00|        txid-4|    false|         1.60161144E9|    2|
|a@gmail.com|2020-10-02 04:05:00|        txid-5|    false|          1.6016115E9|    2|
|a@gmail.com|2020-10-02 05:04:00|        txid-6|     true|         1.60161504E9|    3|
|a@gmail.com|2020-10-05 12:04:00|        txid-7|     true|         1.60189944E9|    1|
|c@gmail.com|2020-12-04 06:04:00|        txid-9|     true|         1.60706184E9|    1|
|b@gmail.com|2020-12-03 03:04:00|        txid-8|     true|         1.60696464E9|    1|
+-----------+-------------------+--------------+---------+---------------------+-----+

为true、false创建一个值为1,0的额外列,并使用累积和<代码>过滤器(列(“条件”))
可以removed@undefined_variable,谢谢你的评论。如何在累积和计算中使用额外列?你能补充一个答案或补充一个评论吗?接受的答案对我有用,但我有几个额外的用例,我可以应用你的建议/方法。创建一个额外的列,值为1,0表示真、假,并使用累积和<代码>过滤器(列(“条件”))可以removed@undefined_variable,谢谢你的评论。如何在累积和计算中使用额外列?你能补充一个答案或补充一个评论吗?被接受的答案对我很有用,但我有一些额外的用例,我可以应用你的建议/方法。当我用
sum
代替
count
时,这不起作用。在这种情况下,它将打印
null
。对于
sum
也有类似的解决方案吗?当我用
sum
代替
count
时,这不起作用。在这种情况下,它将打印
null
sum
是否也有类似的解决方案?