Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/18.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Spark(Scala):计算数据帧中连续活动小时数的用户会话长度_Scala_Apache Spark_Dataframe_Apache Spark Sql - Fatal编程技术网

Spark(Scala):计算数据帧中连续活动小时数的用户会话长度

Spark(Scala):计算数据帧中连续活动小时数的用户会话长度,scala,apache-spark,dataframe,apache-spark-sql,Scala,Apache Spark,Dataframe,Apache Spark Sql,我喜欢做以下转换。给定一个记录用户每小时是否处于活动状态的数据帧,并将用户处于活动状态的连续小时数计为会话,我将尝试收集每个会话中的累积小时数 例如,原始数据帧如下所示: scala> val df = sc.parallelize(List( ("user1",0,true), ("user1",1,true), ("user1",2,false), ("user1",3,true), ("user1",4,false), ("user1",5,false),

我喜欢做以下转换。给定一个记录用户每小时是否处于活动状态的数据帧,并将用户处于活动状态的连续小时数计为会话,我将尝试收集每个会话中的累积小时数

例如,原始数据帧如下所示:

scala> val df = sc.parallelize(List(
  ("user1",0,true),
  ("user1",1,true),
  ("user1",2,false),
  ("user1",3,true),
  ("user1",4,false),
  ("user1",5,false),
  ("user1",6,true),
  ("user1",7,true),
  ("user1",8,true)
)).toDF("user_id","hour_of_day","is_active")
df: org.apache.spark.sql.DataFrame = [user_id: string, hour_of_day: int, is_active: boolean]

  +-------+-----------+---------+
  |user_id|hour_of_day|is_active|
  +-------+-----------+---------+
  |user1  |0          |true     |
  |user1  |1          |true     |
  |user1  |2          |false    |
  |user1  |3          |true     |
  |user1  |4          |false    |
  |user1  |5          |false    |
  |user1  |6          |true     |
  |user1  |7          |true     |
  |user1  |8          |true     |
  +-------+-----------+---------+
我想添加两列,跟踪会话开始的时间和会话的长度。获取其中一个列将允许我解决另一个列,因此任何一个都可以工作

示例如下:

  +-------+-----------+---------+------------------+--------------+
  |user_id|hour_of_day|is_active|session_begin_hour|session_length|
  +-------+-----------+---------+------------------+--------------+
  |user1  |0          |true     |0                 |1             |
  |user1  |1          |true     |0                 |2             |
  |user1  |2          |false    |null              |0             |
  |user1  |3          |true     |3                 |1             |
  |user1  |4          |false    |null              |0             |
  |user1  |5          |false    |null              |0             |
  |user1  |6          |true     |6                 |1             |
  |user1  |7          |true     |6                 |2             |
  |user1  |8          |true     |6                 |3             |
  +-------+-----------+---------+------------------+--------------+
我尝试使用WindowSpec回溯一行,但如果原始DF中不存在该列,则无法基于最后一行计算该列的值

有没有优雅的解决方案来解决这个问题,最好是在Scala中


提前谢谢

首先,让我们确定给定的记录是否标志着会话的开始:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val userWindow = Window.partitionBy($"user_id").orderBy($"hour_of_day")
val prevActive = lag($"is_active", 1).over(userWindow) 
val newSession = $"is_active" && (prevActive.isNull || not(prevActive))

val withInd = df.withColumn("new_session", newSession)

// +-------+-----------+---------+-----------+   
// |user_id|hour_of_day|is_active|new_session|
// +-------+-----------+---------+-----------+
// |  user1|          0|     true|       true|
// |  user1|          1|     true|      false|
// |  user1|          2|    false|      false|
// |  user1|          3|     true|       true|
// |  user1|          4|    false|      false|
// |  user1|          5|    false|      false|
// |  user1|          6|     true|       true|
// |  user1|          7|     true|      false|
// |  user1|          8|     true|      false|
// +-------+-----------+---------+-----------+
接下来,让我们生成会话id:

val session = when(
  $"is_active",
  sum($"new_session".cast("long")).over(userWindow)
)

val withSession = withInd.withColumn("session", session)

// +-------+-----------+---------+-----------+-------+
// |user_id|hour_of_day|is_active|new_session|session|
// +-------+-----------+---------+-----------+-------+
// |  user1|          0|     true|       true|      1|
// |  user1|          1|     true|      false|      1|
// |  user1|          2|    false|      false|   null|
// |  user1|          3|     true|       true|      2|
// |  user1|          4|    false|      false|   null|
// |  user1|          5|    false|      false|   null|
// |  user1|          6|     true|       true|      3|
// |  user1|          7|     true|      false|      3|
// |  user1|          8|     true|      false|      3|
// +-------+-----------+---------+-----------+-------+
最后,让我们创建一个新窗口并计算感兴趣的值:

val userSessionWindow = userWindow.partitionBy($"user_id", $"session")

val sessionBeginHour = when(
  $"is_active",
  min($"hour_of_day").over(userSessionWindow)
)

val sessionLength = when(
  $"is_active",
  $"hour_of_day" + 1 - sessionBeginHour
).otherwise(0)

val result = withSession
  .withColumn("session_begin_hour", sessionBeginHour)
  .withColumn("session_length", sessionLength)
  .drop("new_session")
  .drop("session")

result.orderBy($"hour_of_day").show
// +-------+-----------+---------+------------------+--------------+
// |user_id|hour_of_day|is_active|session_begin_hour|session_length|
// +-------+-----------+---------+------------------+--------------+
// |  user1|          0|     true|                 0|             1|
// |  user1|          1|     true|                 0|             2|
// |  user1|          2|    false|              null|             0|
// |  user1|          3|     true|                 3|             1|
// |  user1|          4|    false|              null|             0|
// |  user1|          5|    false|              null|             0|
// |  user1|          6|     true|                 6|             1|
// |  user1|          7|     true|                 6|             2|
// |  user1|          8|     true|                 6|             3|
// +-------+-----------+---------+------------------+--------------+