Scala Spark dataframe groupby和订单组?

Scala Spark dataframe groupby和订单组?,scala,apache-spark-sql,Scala,Apache Spark Sql,我有以下数据, +-------+----+----+ |user_id|time|item| +-------+----+----+ | 1| 5| ggg| | 1| 5| ddd| | 1| 20| aaa| | 1| 20| ppp| | 2| 3| ccc| | 2| 3| ttt| | 2| 20| eee| +-------+----+----+ 这可以由以下代码生成: val d

我有以下数据,

+-------+----+----+
|user_id|time|item|
+-------+----+----+
|      1|   5| ggg|
|      1|   5| ddd|
|      1|  20| aaa|
|      1|  20| ppp|
|      2|   3| ccc|
|      2|   3| ttt|
|      2|  20| eee|
+-------+----+----+
这可以由以下代码生成:

    val df = sc.parallelize(Array(
      (1, 20, "aaa"),
      (1, 5, "ggg"),
      (2, 3, "ccc"), 
      (1, 20, "ppp"), 
      (1, 5, "ddd"), 
      (2, 20, "eee"), 
      (2, 3, "ttt"))).toDF("user_id", "time", "item")
如何获得结果:

+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
|       1 |    5 | ggg  |        1 |
|       1 |    5 | ddd  |        1 |
|       1 |   20 | aaa  |        2 |
|       1 |   20 | ppp  |        2 |
|       2 |    3 | ccc  |        1 |
|       2 |    3 | ttt  |        1 |
|       2 |   20 | eee  |        2 |
+---------+------+------+----------+

groupby user\u id,time and order by time and rank the group,感谢~

对行进行排序,您可以使用
densite\u rank
窗口功能,通过最终的
orderBy
转换可以实现顺序:

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}

val w = Window.partitionBy("user_id").orderBy("user_id", "time")

val result = df
  .withColumn("order_id", dense_rank().over(w))
  .orderBy("user_id", "time")

result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
|      1|   5| ddd|       1|
|      1|   5| ggg|       1|
|      1|  20| aaa|       2|
|      1|  20| ppp|       2|
|      2|   3| ttt|       1|
|      2|   3| ccc|       1|
|      2|  20| eee|       2|
+-------+----+----+--------+

请注意,项目列中的顺序没有给出

非常感谢~~:)