Scala Spark dataframe groupby和订单组?
我有以下数据,Scala Spark dataframe groupby和订单组?,scala,apache-spark-sql,Scala,Apache Spark Sql,我有以下数据, +-------+----+----+ |user_id|time|item| +-------+----+----+ | 1| 5| ggg| | 1| 5| ddd| | 1| 20| aaa| | 1| 20| ppp| | 2| 3| ccc| | 2| 3| ttt| | 2| 20| eee| +-------+----+----+ 这可以由以下代码生成: val d
+-------+----+----+
|user_id|time|item|
+-------+----+----+
| 1| 5| ggg|
| 1| 5| ddd|
| 1| 20| aaa|
| 1| 20| ppp|
| 2| 3| ccc|
| 2| 3| ttt|
| 2| 20| eee|
+-------+----+----+
这可以由以下代码生成:
val df = sc.parallelize(Array(
(1, 20, "aaa"),
(1, 5, "ggg"),
(2, 3, "ccc"),
(1, 20, "ppp"),
(1, 5, "ddd"),
(2, 20, "eee"),
(2, 3, "ttt"))).toDF("user_id", "time", "item")
如何获得结果:
+---------+------+------+----------+
| user_id | time | item | order_id |
+---------+------+------+----------+
| 1 | 5 | ggg | 1 |
| 1 | 5 | ddd | 1 |
| 1 | 20 | aaa | 2 |
| 1 | 20 | ppp | 2 |
| 2 | 3 | ccc | 1 |
| 2 | 3 | ttt | 1 |
| 2 | 20 | eee | 2 |
+---------+------+------+----------+
groupby user\u id,time and order by time and rank the group,感谢~对行进行排序,您可以使用
densite\u rank
窗口功能,通过最终的orderBy
转换可以实现顺序:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{dense_rank}
val w = Window.partitionBy("user_id").orderBy("user_id", "time")
val result = df
.withColumn("order_id", dense_rank().over(w))
.orderBy("user_id", "time")
result.show()
+-------+----+----+--------+
|user_id|time|item|order_id|
+-------+----+----+--------+
| 1| 5| ddd| 1|
| 1| 5| ggg| 1|
| 1| 20| aaa| 2|
| 1| 20| ppp| 2|
| 2| 3| ttt| 1|
| 2| 3| ccc| 1|
| 2| 20| eee| 2|
+-------+----+----+--------+
请注意,项目列中的顺序没有给出非常感谢~~:)