Apache spark 按键分组，并使用Spark/Scala高效地查找特定时间窗口中发生的事件的上一个时间戳_Apache Spark_Apache Spark Sql_Window Functions

Apache spark 按键分组，并使用Spark/Scala高效地查找特定时间窗口中发生的事件的上一个时间戳

apache-spark

Apache spark 按键分组，并使用Spark/Scala高效地查找特定时间窗口中发生的事件的上一个时间戳,apache-spark,apache-spark-sql,window-functions,Apache Spark,Apache Spark Sql,Window Functions,注意：对于聚合，我的分组每个组最多可以包含5-10K行。因此，高效的代码是非常需要的我的数据 val df1=sc.parallelize（Seq( （“用户2”，“iphone”，“2017-12-23 16:58:08”，“成功”），（“用户2”，“iphone”，“2017-12-23 16:58:12”，“成功”），（“用户2”，“iphone”，“2017-12-23 16:58:20”，“成功”），（“用户2”，“iphone”，“2017-12-23 16:58:25”，“

注意：对于聚合，我的分组每个组最多可以包含5-10K行。因此，高效的代码是非常需要的

我的数据

val df1=sc.parallelize（Seq(
（“用户2”，“iphone”，“2017-12-23 16:58:08”，“成功”），
（“用户2”，“iphone”，“2017-12-23 16:58:12”，“成功”），
（“用户2”，“iphone”，“2017-12-23 16:58:20”，“成功”），
（“用户2”，“iphone”，“2017-12-23 16:58:25”，“成功”），
（“用户2”，“iphone”，“2017-12-23 16:58:35”，“成功”），
（“用户2”，“iphone”，“2017-12-23 16:58:45”，“成功”）
)).toDF（“用户名”、“设备”、“尝试”、“状态”）

我想要什么
根据（用户名、设备）对事件发生的最近时间进行分组

+--------+------+-------------------+-------+-------------------+
|username|device|         attempt_at|   stat|previous_attempt_at|
+--------+------+-------------------+-------+-------------------+
|   user2|iphone|2017-12-23 16:58:45|Success|2017-12-23 16:58:35|
+--------+------+-------------------+-------+-------------------+

所需输出中的异常：
现在，因为我提到了它必须在一个特定的时间窗口中，例如在下面的输入数据集中，最后一行是最新日期时间戳为12月23日。现在，如果我想要返回1天的特定时间窗口，并给出最后一次尝试，'previous\u trument\u at'列将为空，因为前一天没有任何事件应该在1月22日发生。这完全取决于输入时间戳的范围

//Initial Data
+--------+------+-------------------+-------+
|username|device|         attempt_at|   stat|
+--------+------+-------------------+-------+
|   user2|iphone|2017-12-20 16:58:08|Success|
|   user2|iphone|2017-12-20 16:58:12|Success|
|   user2|iphone|2017-12-20 16:58:20|Success|
|   user2|iphone|2017-12-20 16:58:25|Success|
|   user2|iphone|2017-12-20 16:58:35|Success|
|   user2|iphone|2017-12-23 16:58:45|Success|
+--------+------+-------------------+-------+

// Desired Output
A grouping by (username,device) for the latest time an event occurred.

    +--------+------+-------------------+-------+-------------------+
    |username|device|         attempt_at|   stat|previous_attempt_at|
    +--------+------+-------------------+-------+-------------------+
    |   user2|iphone|2017-12-23 16:58:45|Success|               null|
    +--------+------+-------------------+-------+-------------------+

我拥有的

val w=（Window.partitionBy（“用户名”、“设备”）
.orderBy（col（“尝试”）.cast（“时间戳”）.cast（“长”））
.范围介于（-3600，-1）之间
)
val df2=df1.withColumn（“上一次尝试”），last（“尝试”），over（w））
+--------+------+-------------------+-------+-------------------+
|用户名|设备|尝试|状态|上一次尝试||
+--------+------+-------------------+-------+-------------------+
|用户2 | iphone | 2017-12-23 16:58:08 |成功|空|
|用户2 | iphone | 2017-12-23 16:58:12 |成功| 2017-12-23 16:58:08|
|用户2 | iphone | 2017-12-23 16:58:20 |成功| 2017-12-23 16:58:12|
|用户2 | iphone | 2017-12-23 16:58:25 |成功| 2017-12-23 16:58:20|
|用户2 | iphone | 2017-12-23 16:58:35 |成功| 2017-12-23 16:58:25|
|用户2 | iphone | 2017-12-23 16:58:45 |成功| 2017-12-23 16:58:35|
+--------+------+-------------------+-------+-------------------+

注释。我的代码对特定用户分组中的每一行进行窗口化。

这在处理大规模数据时效率极低，但也没有给出最新的尝试。除了最后一行之外，我不需要所有的行。

您只需要一个额外的

groupBy

和

aggregation

，但在此之前，您需要

collect\u list

函数来累计收集以前的日期，还需要

udf

函数来检查以前的尝试是否在时间限制内并将三列（

“尝试”、“统计”、“上一次尝试”

）转换为

struct

，以选择最后一列作为

对于以前的输入数据，请执行以下操作

您可以通过将

udf

函数中的

ChronoUnit.DAYS

更改为

ChronoUnit.hours

等来更改小时的逻辑，谢谢，但是时间窗口呢，假设我想回顾1小时、1周等。例如，对于上面的代码，如果1-5的所有行都发生在2-3小时之前，第6行的结果将为null。如果回首过去的时间是一周，我们会得到给出的结果。谢谢@Ramesh Maharjan。请查看新的编辑，它位于“所需输出下的异常”下。谢谢，请解释“max（struct）”如何从数组中选择最新的时间戳行。这是可行的，但对我来说毫无意义。另外，if可以解释UDF中的“反向脂肪箭头”是如何工作的。struct有三个元素，max函数从struct中选择最大值。它首先检查第一个元素，如果是tie，则从第二个元素中选择max，依此类推。这不是胖的反向箭头，它小于或等于符号（

//Initial Data
+--------+------+-------------------+-------+
|username|device|         attempt_at|   stat|
+--------+------+-------------------+-------+
|   user2|iphone|2017-12-20 16:58:08|Success|
|   user2|iphone|2017-12-20 16:58:12|Success|
|   user2|iphone|2017-12-20 16:58:20|Success|
|   user2|iphone|2017-12-20 16:58:25|Success|
|   user2|iphone|2017-12-20 16:58:35|Success|
|   user2|iphone|2017-12-23 16:58:45|Success|
+--------+------+-------------------+-------+

// Desired Output
A grouping by (username,device) for the latest time an event occurred.

    +--------+------+-------------------+-------+-------------------+
    |username|device|         attempt_at|   stat|previous_attempt_at|
    +--------+------+-------------------+-------+-------------------+
    |   user2|iphone|2017-12-23 16:58:45|Success|               null|
    +--------+------+-------------------+-------+-------------------+

import org.apache.spark.sql.functions._
import java.time._
import java.time.temporal._
import java.time.format._
def durationUdf = udf((actualtimestamp: String, timestamps: Seq[String])=> {
  val formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss")
  val actualDateTime = LocalDateTime.parse(actualtimestamp, formatter)
  val diffDates = timestamps.init.filter(x => LocalDateTime.from(LocalDateTime.parse(x, formatter)).until(actualDateTime, ChronoUnit.DAYS) <= 1)
  if(diffDates.size > 0) diffDates.last else null
})

import org.apache.spark.sql.expressions._
val w = Window.partitionBy("username", "device").orderBy(col("attempt_at").cast("timestamp").cast("long"))

val df2 = df1.withColumn("previous_attempt_at", durationUdf(col("attempt_at"), collect_list("attempt_at").over(w)))
  .withColumn("struct", struct(col("attempt_at").cast("timeStamp").as("attempt_at"),col("stat"), col("previous_attempt_at")))
  .groupBy("username", "device").agg(max("struct").as("struct"))
  .select(col("username"), col("device"), col("struct.attempt_at"), col("struct.stat"), col("struct.previous_attempt_at"))

+--------+------+---------------------+-------+-------------------+
|username|device|attempt_at           |stat   |previous_attempt_at|
+--------+------+---------------------+-------+-------------------+
|user2   |iphone|2017-12-23 16:58:45.0|Success|null               |
+--------+------+---------------------+-------+-------------------+

+--------+------+---------------------+-------+-------------------+
|username|device|attempt_at           |stat   |previous_attempt_at|
+--------+------+---------------------+-------+-------------------+
|user2   |iphone|2017-12-23 16:58:45.0|Success|2017-12-23 16:58:35|
+--------+------+---------------------+-------+-------------------+