Scala 第一，时间#188 ASC先为空]，错误，0 +-Exchange哈希分区（用户id为184L，200） +-*（1）项目[事件#182，演员（时间#183作为时间戳）作为时间#188，用户#id#184L] +-*（1）FileScan json[event#182，time#183，user#id#184L]批处理：false，格式：json，位置：InMemoryFileIndex[file:../cleanjsonLines.json]，PartitionFilters:[]，PushedFilters:[]，ReadSchema:struct_Scala_Dataframe_Apache Spark_Apache Spark Sql_Apache Spark Dataset

Scala 第一，时间#188 ASC先为空]，错误，0 +-Exchange哈希分区（用户id为184L，200） +-*（1）项目[事件#182，演员（时间#183作为时间戳）作为时间#188，用户#id#184L] +-*（1）FileScan json[event#182，time#183，user#id#184L]批处理：false，格式：json，位置：InMemoryFileIndex[file:../cleanjsonLines.json]，PartitionFilters:[]，PushedFilters:[]，ReadSchema:struct

scala dataframe apache-spark

Scala 第一，时间#188 ASC先为空]，错误，0 +-Exchange哈希分区（用户id为184L，200） +-*（1）项目[事件#182，演员（时间#183作为时间戳）作为时间#188，用户#id#184L] +-*（1）FileScan json[event#182，time#183，user#id#184L]批处理：false，格式：json，位置：InMemoryFileIndex[file:../cleanjsonLines.json]，PartitionFilters:[]，PushedFilters:[]，ReadSchema:struct,scala,dataframe,apache-spark,apache-spark-sql,apache-spark-dataset,Scala,Dataframe,Apache Spark,Apache Spark Sql,Apache Spark Dataset,原始数据 {"user_id":346214,"event":"logout","time":"2019-11-20 00:19:41"} {"user_id":346214,"event":"login","time":"2019-11-20 00:19:43"} {"user_id":346214,"event":"logout","time":"2019-11-20 00:22:09"} {"user_id":346214,"event":"login","time":"2019-11-2

原始数据

{"user_id":346214,"event":"logout","time":"2019-11-20 00:19:41"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:19:43"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:22:09"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:22:12"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:24:12"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:24:14"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:25:43"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:25:45"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:29:55"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:29:57"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:30:00"}


//create dataframe with only login events sorted by user_id, time
val leftDF = rawDataFrame.filter(col("event")===lit("login")).orderBy("user_id","time")
leftDF.show()

//create dataframe with only logout events sorted by user_id, time
val rightDF = rawDataFrame.filter(col("event")===lit("logout")).orderBy("user_id","time")
rightDF.show()

// join left and right dataframe such that logoutDF row time is greater that loginDF row time.
val joinedDF = leftDF.as("loginDF")
  .join(rightDF.as("logoutDF"),
    col("logoutDF.time") >= col("loginDF.time")
      &&
      col("loginDF.user_id") === col("logoutDF.user_id"),"left")
  .orderBy("loginDF.user_id","loginDF.time","logoutDF.time")
  .groupBy(col("loginDF.user_id").as("user_id"),col("loginDF.time").as("login"))
  .agg(first("logoutDF.time").as("logout"))
  .orderBy("user_id","login","logout")
// this will create data like below, now we have to remove the overlap from below data

{"user_id":346214,"login":"2019-11-20 00:25:45","logout":"2019-11-20 00:29:55","group_id":4,"updated_login":"2019-11-20 00:25:45","update_logout":"2019-11-20 00:29:55","session_time":250}
{"user_id":346214,"login":"2019-11-20 00:24:14","logout":"2019-11-20 00:25:43","group_id":3,"updated_login":"2019-11-20 00:24:14","update_logout":"2019-11-20 00:25:43","session_time":89}
{"user_id":346214,"login":"2019-11-20 00:29:57","logout":"2019-11-20 00:30:00","group_id":5,"updated_login":"2019-11-20 00:29:57","update_logout":"2019-11-20 00:30:00","session_time":3}
{"user_id":346214,"login":"2019-11-20 00:22:12","logout":"2019-11-20 00:24:12","group_id":2,"updated_login":"2019-11-20 00:22:12","update_logout":"2019-11-20 00:24:12","session_time":120}
{"user_id":346214,"login":"2019-11-20 00:19:43","logout":"2019-11-20 00:22:09","group_id":1,"updated_login":"2019-11-20 00:19:43","update_logout":"2019-11-20 00:22:09","session_time":146}

// to remove the overlap, I followed this post
https://stackoverflow.com/questions/52877237/in-spark-scala-how-to-check-overlapping-dates-from-adjacent-rows-in-a-dataframe/52881823

val win1 = Window.partitionBy(col("user_id")).orderBy(col("login"), col("logout"))
val win2 = Window.partitionBy(col("user_id"), col("group_id"))
val finalDF = joinedDF.
  withColumn("group_id", when(
    col("login").between(lag(col("login"), 1).over(win1), lag(col("logout"), 1).over(win1)), null
  ).otherwise(monotonically_increasing_id)
  ).
  withColumn("group_id", last(col("group_id"), ignoreNulls=true).
    over(win1.rowsBetween(Window.unboundedPreceding, 0))
  ).
  withColumn("updated_login", min(col("login")).over(win2)).
  withColumn("update_logout", max(col("logout")).over(win2)).
  orderBy("user_id", "login", "logout")
  .dropDuplicates(Seq("user_id","updated_login", "updated_logout"))
  .withColumn("session_time", unix_timestamp(col("updated_logout")) - unix_timestamp(col("updated_login")))

//this will generate below data
{"user_id":346214,"logged_in_min":10.133333333333333}
val result = finalDF.groupBy("user_id").agg((sum("session_time")/60) as "logged_in_min").filter(col("logged_in_min").isNotNull)
result.coalesce(1).write.format("json").mode(SaveMode.Overwrite).save("../final_result.json")

我认为这是真的很难识别哪一个是登录注销。如果有一个会话唯一id，那么它将更容易。是的，您是对的，有一个会话标识符可能有助于确定确切的登录名。但我们的原始数据中没有这一点。因此，为了解决我们的问题，我们决定在第一次“注销”之前考虑最后一次登录，因为OTR需要“登录”。我在问题描述中提到了一个更新的例子。我必须以某种方式清理数据，使用上述假设获得数据，并使用连续登录和注销的干净数据创建另一个数据框架。这又是一种有趣的方式。但是我再次道歉，我想我没有正确地解释这个问题。我们必须找出每个用户的窗口以及这些窗口的总和。我在标题“更新2”下更新了上述内容。我正在考虑像下面的帖子中那样转换原始数据，然后使用您的旧方法获得最终结果。你怎么说？让我们来吧，因为我很担心。这将失败，数据如下。{“用户id”：671041，“事件”：“注销”，“时间”：“2019-11-20 00:18:16”}{“用户id”：671041，“事件”：“登录”，“时间”：“2019-11-20 00:18:36”}{“用户id”：671041，“事件”：“注销”，“时间”：“注销”，“时间”：“2019-11-20 00:18:45”}{“用户id”：671041，“事件”：“注销”，“时间”：“2019-11-20 00:30:00”}使用上述代码的聚合将给出{“登录时间”：“2019“2019-11-20:0:18:16”，“UsSuriID”：671041，“20”，应该是{0}-11-20:18:36，“LogOutTimeTimes”：“2019-11-20:18:45”，“UsSeriID”：671041，“Session StimeTimes”：9 }，如果我错了请纠正我。我得到你的情景，只有当他们有一个有效的会话（登录和注销）时，我才会考虑用户吗？.eg:992210将无法在结果数据框中使用，因为我们只有一个注销事件。是的，这是正确的假设。无论如何，您可以查看我在问题描述中更新的解决方案吗？如果有任何错误，或者您认为有什么可以进一步优化，请告诉我。我已经更新了我的答案，还附上了物理计划。如果你会发现它更有效率。

val rawDataFrame = sparkSession.read.option("multiline", "true").json(cleanJsonLines)

{"user_id": 978699, "logged_in_sec":8} // (2019-11-20 00:14:47 - 2019-11-20 00:14:46) + (2019-11-20 00:14:57 - 2019-11-20 00:14:50)
{"user_id": 992210, "logged_in_sec":0}
{"user_id": 823323, "logged_in_sec":1}

{"event" : "login","time" : "2019-11-20 00:14:46","user_id" : 978699}
{"event" : "logout","time" : "2019-11-20 00:14:46","user_id" : 992210}
{"event" : "login","time" : "2019-11-20 00:14:46","user_id" : 823323}
{"event" : "logout","time" : "2019-11-20 00:14:47","user_id" : 978699}
{"event" : "logout","time" : "2019-11-20 00:14:47","user_id" : 823323}
{"event" : "login","time" : "2019-11-20 00:14:50","user_id" : 978699}
{"event" : "logout","time" : "2019-11-20 00:14:55","user_id" : 978699}
{"event" : "logout","time" : "2019-11-20 00:14:56","user_id" : 978699}    
{"event" : "logout","time" : "2019-11-20 00:14:57","user_id" : 978699}

login 
login
login <- window start
logout <- window end
logout
login <- window start
logout <- window end
login <- window start
logout <- window end
logout

{"event" : "login","time" : "2019-11-20 00:14:46","user_id" : 978699}
{"event" : "logout","time" : "2019-11-20 00:14:47","user_id" : 978699}

{"event" : "login","time" : "2019-11-20 00:14:50","user_id" : 978699}
{"event" : "logout","time" : "2019-11-20 00:14:55","user_id" : 978699}

import org.apache.spark.sql.expressions.Window
val idPartition = Window.partitionBy("user_id").orderBy("time")

val df2 = df.withColumn("login_index", sum(when($"event" === "login", 1)).over(idPartition))
            .withColumn("logout_index", sum(when($"event" === "logout", 1)).over(idPartition))

df2.show(false)



val login = df2.where($"event" === "login")
                .withColumnRenamed("time", "login_time")
                .drop("logout_index")

login.show(false)

val logout = df2.where($"event" === "logout")
                .withColumnRenamed("time", "logout_time")
                .drop("login_index")

logout.show(false)

val finaldf = login.as("a").join(logout.as("b"), $"a.login_index" === $"b.logout_index"  && $"a.user_id" === $"b.user_id", "inner")
                .withColumn("session_time", unix_timestamp($"b.logout_time") - unix_timestamp($"a.login_time"))
                .select("a.login_time", "b.logout_time", "a.user_id", "a.login_index", "b.logout_index", "session_time")

finaldf.show(false)

val result = finaldf.groupBy("user_id").agg(sum("session_time") as "logged_in_sec")

result.show(false)

+------+-------------------+-------+-----------+------------+
|event |time               |user_id|login_index|logout_index|
+------+-------------------+-------+-----------+------------+
|logout|2019-11-20 00:14:46|992210 |null       |1           |
|login |2019-11-20 00:14:46|978699 |1          |null        |
|logout|2019-11-20 00:14:47|978699 |1          |1           |
|login |2019-11-20 00:14:50|978699 |2          |1           |
|logout|2019-11-20 00:14:55|978699 |2          |2           |
|logout|2019-11-20 00:14:56|978699 |2          |3           |
|logout|2019-11-20 00:14:57|978699 |2          |4           |
|login |2019-11-20 00:14:46|823323 |1          |null        |
|logout|2019-11-20 00:14:47|823323 |1          |1           |
+------+-------------------+-------+-----------+------------+

+-----+-------------------+-------+-----------+
|event|login_time         |user_id|login_index|
+-----+-------------------+-------+-----------+
|login|2019-11-20 00:14:46|978699 |1          |
|login|2019-11-20 00:14:50|978699 |2          |
|login|2019-11-20 00:14:46|823323 |1          |
+-----+-------------------+-------+-----------+

+------+-------------------+-------+------------+
|event |logout_time        |user_id|logout_index|
+------+-------------------+-------+------------+
|logout|2019-11-20 00:14:46|992210 |1           |
|logout|2019-11-20 00:14:47|978699 |1           |
|logout|2019-11-20 00:14:55|978699 |2           |
|logout|2019-11-20 00:14:56|978699 |3           |
|logout|2019-11-20 00:14:57|978699 |4           |
|logout|2019-11-20 00:14:47|823323 |1           |
+------+-------------------+-------+------------+

+-------------------+-------------------+-------+-----------+------------+------------+
|login_time         |logout_time        |user_id|login_index|logout_index|session_time|
+-------------------+-------------------+-------+-----------+------------+------------+
|2019-11-20 00:14:46|2019-11-20 00:14:47|978699 |1          |1           |1           |
|2019-11-20 00:14:50|2019-11-20 00:14:55|978699 |2          |2           |5           |
|2019-11-20 00:14:46|2019-11-20 00:14:47|823323 |1          |1           |1           |
+-------------------+-------------------+-------+-----------+------------+------------+

+-------+-------------+
|user_id|logged_in_sec|
+-------+-------------+
|978699 |6            |
|823323 |1            |
+-------+-------------+

**source extraction**
val rawDataFrame = spark.read.format("json").load("../cleanjsonLines.json")
rawDataFrame.printSchema

root
 |-- event: string (nullable = true)
 |-- time: string (nullable = true)
 |-- user_id: long (nullable = true)

**casting to timestamp**
val tfDataFrame = rawDataFrame.selectExpr("event","to_timestamp(time) as time","user_id")
tfDataFrame.printSchema

root
 |-- event: string (nullable = true)
 |-- time: timestamp (nullable = true)
 |-- user_id: long (nullable = true)

**creating temp view**
tfDataFrame.createOrReplaceTempView("SysEvent")

**Creating widowed temp view for each valid sessions**
spark.sql("select * from (select *,lag(event,-1) over (partition by user_id  order by time) as next_event, lag(time,-1) over (partition by user_id order by time) as next_time from SysEvent) a where event = 'login' and next_event = 'logout' order by user_id,time").createOrReplaceTempView("WindowSysEvent")
spark.sql("select * from WindowSysEvent").show()

Result for source dataset:    
+-----+-------------------+-------+----------+-------------------+
|event|               time|user_id|next_event|          next_time|
+-----+-------------------+-------+----------+-------------------+
|login|2019-11-20 00:14:46| 823323|    logout|2019-11-20 00:14:47|
|login|2019-11-20 00:14:46| 978699|    logout|2019-11-20 00:14:47|
|login|2019-11-20 00:14:50| 978699|    logout|2019-11-20 00:14:57|
+-----+-------------------+-------+----------+-------------------+

Result for updated dataset:
+-----+-------------------+-------+----------+-------------------+
|event|               time|user_id|next_event|          next_time|
+-----+-------------------+-------+----------+-------------------+
|login|2019-11-20 00:14:46| 823323|    logout|2019-11-20 00:14:47|
|login|2019-11-20 00:14:46| 978699|    logout|2019-11-20 00:14:47|
|login|2019-11-20 00:14:50| 978699|    logout|2019-11-20 00:14:55|
+-----+-------------------+-------+----------+-------------------+

**aggregation for valid sessions**
val result = spark.sql("select user_id, sum(unix_timestamp(next_time) - unix_timestamp(time)) as logged_in_sec from windowSysEvent group by user_id")
result.show()

Result for source dataset:
+-------+-------------+
|user_id|logged_in_sec|
+-------+-------------+
| 978699|            8|
| 823323|            1|
+-------+-------------+

Result for updated dataset:
+-------+-------------+
|user_id|logged_in_sec|
+-------+-------------+
| 978699|            6|
| 823323|            1|
+-------+-------------+

**write to target**
result.coalesce(1).write.format("json").save("../result.json")

Result for source dataset:
{"user_id":823323,"logged_in_sec":1}
{"user_id":978699,"logged_in_sec":8}

Result updated dataset:
{"user_id":823323,"logged_in_sec":1}
{"user_id":978699,"logged_in_sec":6}

result.explain
== Physical Plan ==
*(5) HashAggregate(keys=[user_id#184L], functions=[sum((unix_timestamp(next_time#898, yyyy-MM-dd HH:mm:ss, Some(..)) - unix_timestamp(time#188, yyyy-MM-dd HH:mm:ss, Some(..))))])
+- Exchange hashpartitioning(user_id#184L, 200)
   +- *(4) HashAggregate(keys=[user_id#184L], functions=[partial_sum((unix_timestamp(next_time#898, yyyy-MM-dd HH:mm:ss, Some(..)) - unix_timestamp(time#188, yyyy-MM-dd HH:mm:ss, Some(..))))])
      +- *(4) Sort [user_id#184L ASC NULLS FIRST, time#188 ASC NULLS FIRST], true, 0
         +- Exchange rangepartitioning(user_id#184L ASC NULLS FIRST, time#188 ASC NULLS FIRST, 200)
            +- *(3) Project [time#188, user_id#184L, next_time#898]
               +- *(3) Filter (((isnotnull(event#182) && isnotnull(next_event#897)) && (event#182 = login)) && (next_event#897 = logout))
                  +- Window [lag(event#182, -1, null) windowspecdefinition(user_id#184L, time#188 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS next_event#897, lag(time#188, -1, null) windowspecdefinition(user_id#184L, time#188 ASC NULLS FIRST, specifiedwindowframe(RowFrame, 1, 1)) AS next_time#898], [user_id#184L], [time#188 ASC NULLS FIRST]
                     +- *(2) Sort [user_id#184L ASC NULLS FIRST, time#188 ASC NULLS FIRST], false, 0
                        +- Exchange hashpartitioning(user_id#184L, 200)
                           +- *(1) Project [event#182, cast(time#183 as timestamp) AS time#188, user_id#184L]
                              +- *(1) FileScan json [event#182,time#183,user_id#184L] Batched: false, Format: JSON, Location: InMemoryFileIndex[file:../cleanjsonLines.json], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<event:string,time:string,user_id:bigint>

{"user_id":346214,"event":"logout","time":"2019-11-20 00:19:41"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:19:43"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:22:09"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:22:12"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:24:12"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:24:14"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:25:43"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:25:45"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:29:55"}
{"user_id":346214,"event":"login","time":"2019-11-20 00:29:57"}
{"user_id":346214,"event":"logout","time":"2019-11-20 00:30:00"}


//create dataframe with only login events sorted by user_id, time
val leftDF = rawDataFrame.filter(col("event")===lit("login")).orderBy("user_id","time")
leftDF.show()

//create dataframe with only logout events sorted by user_id, time
val rightDF = rawDataFrame.filter(col("event")===lit("logout")).orderBy("user_id","time")
rightDF.show()

// join left and right dataframe such that logoutDF row time is greater that loginDF row time.
val joinedDF = leftDF.as("loginDF")
  .join(rightDF.as("logoutDF"),
    col("logoutDF.time") >= col("loginDF.time")
      &&
      col("loginDF.user_id") === col("logoutDF.user_id"),"left")
  .orderBy("loginDF.user_id","loginDF.time","logoutDF.time")
  .groupBy(col("loginDF.user_id").as("user_id"),col("loginDF.time").as("login"))
  .agg(first("logoutDF.time").as("logout"))
  .orderBy("user_id","login","logout")
// this will create data like below, now we have to remove the overlap from below data

{"user_id":346214,"login":"2019-11-20 00:25:45","logout":"2019-11-20 00:29:55","group_id":4,"updated_login":"2019-11-20 00:25:45","update_logout":"2019-11-20 00:29:55","session_time":250}
{"user_id":346214,"login":"2019-11-20 00:24:14","logout":"2019-11-20 00:25:43","group_id":3,"updated_login":"2019-11-20 00:24:14","update_logout":"2019-11-20 00:25:43","session_time":89}
{"user_id":346214,"login":"2019-11-20 00:29:57","logout":"2019-11-20 00:30:00","group_id":5,"updated_login":"2019-11-20 00:29:57","update_logout":"2019-11-20 00:30:00","session_time":3}
{"user_id":346214,"login":"2019-11-20 00:22:12","logout":"2019-11-20 00:24:12","group_id":2,"updated_login":"2019-11-20 00:22:12","update_logout":"2019-11-20 00:24:12","session_time":120}
{"user_id":346214,"login":"2019-11-20 00:19:43","logout":"2019-11-20 00:22:09","group_id":1,"updated_login":"2019-11-20 00:19:43","update_logout":"2019-11-20 00:22:09","session_time":146}

// to remove the overlap, I followed this post
https://stackoverflow.com/questions/52877237/in-spark-scala-how-to-check-overlapping-dates-from-adjacent-rows-in-a-dataframe/52881823

val win1 = Window.partitionBy(col("user_id")).orderBy(col("login"), col("logout"))
val win2 = Window.partitionBy(col("user_id"), col("group_id"))
val finalDF = joinedDF.
  withColumn("group_id", when(
    col("login").between(lag(col("login"), 1).over(win1), lag(col("logout"), 1).over(win1)), null
  ).otherwise(monotonically_increasing_id)
  ).
  withColumn("group_id", last(col("group_id"), ignoreNulls=true).
    over(win1.rowsBetween(Window.unboundedPreceding, 0))
  ).
  withColumn("updated_login", min(col("login")).over(win2)).
  withColumn("update_logout", max(col("logout")).over(win2)).
  orderBy("user_id", "login", "logout")
  .dropDuplicates(Seq("user_id","updated_login", "updated_logout"))
  .withColumn("session_time", unix_timestamp(col("updated_logout")) - unix_timestamp(col("updated_login")))

//this will generate below data
{"user_id":346214,"logged_in_min":10.133333333333333}
val result = finalDF.groupBy("user_id").agg((sum("session_time")/60) as "logged_in_min").filter(col("logged_in_min").isNotNull)
result.coalesce(1).write.format("json").mode(SaveMode.Overwrite).save("../final_result.json")