Scala 根据Apache Spark中的条件为点击流数据生成会话id
我们如何使用Spark(Scala)dataframes在以下两个条件下为点击流数据生成唯一的会话idScala 根据Apache Spark中的条件为点击流数据生成会话id,scala,apache-spark,Scala,Apache Spark,我们如何使用Spark(Scala)dataframes在以下两个条件下为点击流数据生成唯一的会话id 会话在30分钟不活动后过期(表示30分钟内没有点击流数据) 会话将保持活动状态,总持续时间为2小时。2小时后,续订会话 输入: UserId | Click Time ----------------------------- U1 | 2019-01-01T11:00:00Z U1 | 2019-01-01T11:15:00Z U1 | 2019-01-01T1
UserId | Click Time
-----------------------------
U1 | 2019-01-01T11:00:00Z
U1 | 2019-01-01T11:15:00Z
U1 | 2019-01-01T12:00:00Z
U1 | 2019-01-01T12:20:00Z
U1 | 2019-01-01T15:00:00Z
U2 | 2019-01-01T11:00:00Z
U2 | 2019-01-02T11:00:00Z
U2 | 2019-01-02T11:25:00Z
U2 | 2019-01-02T11:50:00Z
U2 | 2019-01-02T12:15:00Z
U2 | 2019-01-02T12:40:00Z
U2 | 2019-01-02T13:05:00Z
U2 | 2019-01-02T13:20:00Z
预期产出
UserId | Click Time | SessionId
-----------------------------------------
U1 | 2019-01-01T11:00:00Z | Session1
U1 | 2019-01-01T11:15:00Z | Session1
U1 | 2019-01-01T12:00:00Z | Session2
U1 | 2019-01-01T12:20:00Z | Session2
U1 | 2019-01-01T15:00:00Z | Session3
U2 | 2019-01-01T11:00:00Z | Session4
U2 | 2019-01-02T11:00:00Z | Session5
U2 | 2019-01-02T11:25:00Z | Session5
U2 | 2019-01-02T11:50:00Z | Session5
U2 | 2019-01-02T12:15:00Z | Session5
U2 | 2019-01-02T12:40:00Z | Session5
U2 | 2019-01-02T13:05:00Z | Session6
U2 | 2019-01-02T13:20:00Z | Session6
我无法应用第二个条件,因为我们需要累计会话持续时间,并检查累计值是否小于2小时。如果累计持续时间大于2小时,则需要分配新会话
请帮我做这个
提前谢谢 val df=spark.read.option(“header”,true)。option(“delimiter”,“|”).csv(“dataset”)。选择(col(“user_id”)、col(“click_time”)。强制转换(“timestamp”))
df
.withColumn(“lag\u click\u time”),lag(“click\u time”,1)。over(Window.partitionBy(“user\u id”)。orderBy(“click\u time”))
.withColumn(“time_diff”),((col(“click_time”).cast(“long”)-col(“lag_click_time”).cast(“long”)/(60*30)))
.na.填充(0)
.withColumn(“is_new_session”,当(col(“time_diff”)>1,1时)。否则(0))
.withColumn(“临时会话id”)、sum(col(“是新会话”)。over(Window.partitionBy(“用户id”)。orderBy(“单击时间”))
.withColumn(“第一次单击时间”)、first(第二列(“单击时间”))。over(Window.partitionBy(“用户id”、“临时会话id”)。orderBy(“单击时间”))
.withColumn(“time_diff2”),((col(“click_time”).cast(“long”)-col(“first_click_time”).cast(“long”)/(60*60*2)).cast(“int”))
.withColumn(“会话id”),(col(“时间差2”)+col(“临时会话id”))
.drop(“滞后单击时间”、“时间差”、“是新会话”、“临时会话id”、“首次单击时间”、“时间差2”)
.show()
对提供的解决方案进行适当解释,并附上一些注释。已在此处回答:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.col
val df = sc.parallelize(List(("U1","2019-01-01T11:00:00Z"),("U1","2019-01-01T11:15:00Z"),("U1","2019-01-01T12:00:00Z"),("U1","2019-01-01T12:20:00Z"),("U1","2019-01-01T15:00:00Z"),("U2","2019-01-01T11:00:00Z"),("U2","2019-01-02T11:00:00Z"),("U2","2019-01-02T11:25:00Z"),("U2","2019-01-02T11:50:00Z"),("U2","2019-01-02T12:15:00Z"),("U2","2019-01-02T12:40:00Z"),("U2","2019-01-02T13:05:00Z"),("U2","2019-01-02T13:20:00Z"))).toDF("UserId","time").withColumn("ClickTime", col("time").cast("timestamp")).drop(col("time"))
val windowSpec = Window.partitionBy("userid").orderBy("clicktime")
val lagWindow = lag(col("clicktime"), 1).over(windowSpec)
val df1 = df.select(col("userid"), col("clicktime"), lagWindow.alias("prevclicktime")).withColumn("timediff", (col("clicktime").cast("long") - col("prevclicktime").cast("long"))).na.fill(Map("timediff" -> 0)).drop(col("prevclicktime"))
df1.show(truncate = false)