在每组数据帧spark scala中选择前n行_Scala_Apache Spark_Apache Spark Sql_Spark Streaming

在每组数据帧spark scala中选择前n行

scala apache-spark

在每组数据帧spark scala中选择前n行,scala,apache-spark,apache-spark-sql,spark-streaming,Scala,Apache Spark,Apache Spark Sql,Spark Streaming,我在spark scala中有一个df，我需要根据id进行分组，并根据DateTime进行排序，DateTime位于第二列，并且只取每组的前5行 ------------------------------------ |id| DateTime | ------------------------------------ |340054675199675|15-01-2018 19:43:23| |340054675199675|15-01-2018

我在spark scala中有一个df，我需要根据id进行分组，并根据DateTime进行排序，DateTime位于第二列，并且只取每组的前5行

------------------------------------
|id|             DateTime          |
------------------------------------
|340054675199675|15-01-2018 19:43:23|
|340054675199675|15-01-2018 10:56:43|
|340028465709212|10-01-2018 02:47:11|
|340054675199675|09-01-2018 10:59:10|
|340028465709212|02-01-2018 03:25:35|
|340054675199675|28-12-2017 05:48:04|
|340054675199675|21-12-2017 15:47:51|
|340028465709212|18-12-2017 10:33:04|
|340028465709212|16-12-2017 19:55:40|
|340028465709212|16-12-2017 19:55:40|
|340028465709212|12-12-2017 07:04:51|
|340054675199675|06-12-2017 08:52:38|
------------------------------------

val dfTop=df.withColumn（“rn”，row_number.over（w））。其中（$“rn”==10）。drop（“rn”）
val dfMax=df.groupBy（$“id”.as（“grouped_id”））.agg（first（$“DateTime”）.as（“max_value”））.limit（10）
val dfTopByJoin=df.join（广播（dfMax），
（$“id”===$“分组id”）&&（$“日期时间”===$“最大值”））

用于实现所需输出的Scala代码

Dataframe列（DateTime）是字符串格式的，因此需要转换为时间戳，以便我们可以根据需求轻松地对数据进行排序

用于检索所需输出的应用窗口函数

val w=Window.partitionBy（“id”）.orderBy（“DateTime”）
val dfTop=df3.withColumn（“rn”，row_number.over（w））.filter（$“rn”dfTop.show
+---------------+-------------------+
|id |日期时间|
+---------------+-------------------+
|340028465709212|2017-12-12 07:04:51|
|340028465709212|2017-12-16 19:55:40|
|340028465709212|2017-12-16 19:55:40|
|340028465709212|2017-12-18 10:33:04|
|340028465709212|2018-01-02 03:25:35|
|340054675199675|2017-12-06 08:52:38|
|340054675199675|2017-12-21 15:47:51|
|340054675199675|2017-12-28 05:48:04|
|340054675199675|2018-01-09 10:59:10|
|340054675199675|2018-01-15 10:56:43|
+---------------+-------------------+

然后你会得到你想要的答案

用于实现所需输出的Scala代码

Dataframe列（DateTime）是字符串格式的，因此需要转换为时间戳，以便我们可以根据需求轻松地对数据进行排序

用于检索所需输出的应用窗口函数

val w=Window.partitionBy（“id”）.orderBy（“DateTime”）
val dfTop=df3.withColumn（“rn”，row_number.over（w））.filter（$“rn”dfTop.show
+---------------+-------------------+
|id |日期时间|
+---------------+-------------------+
|340028465709212|2017-12-12 07:04:51|
|340028465709212|2017-12-16 19:55:40|
|340028465709212|2017-12-16 19:55:40|
|340028465709212|2017-12-18 10:33:04|
|340028465709212|2018-01-02 03:25:35|
|340054675199675|2017-12-06 08:52:38|
|340054675199675|2017-12-21 15:47:51|
|340054675199675|2017-12-28 05:48:04|
|340054675199675|2018-01-09 10:59:10|
|340054675199675|2018-01-15 10:56:43|
+---------------+-------------------+

然后你会得到你想要的答案。HAppy Hadooop

你是否在使用spark streaming[给定标签]？是的，我将从streaming获取这些数据是数据流还是结构化流？结构化。但现在我更感兴趣的是对上述df进行排序。你为什么不做以下操作：val w=Window.partitionBy（“id”）val dfTop=df.withColumn（“rn”，row_number.over（w））。您在哪里使用spark streaming[给定标记]？是的，我将从streaming获取这些数据是数据流还是结构化流？结构化。但现在我更感兴趣的是对上述df进行排序。为什么不执行以下操作：val w=Window.partitionBy（“id”）val dfTop=df.withColumn（“rn”，第（w）行的编号）。其中（$“rn”

import org.apache.spark.sql.expressions.Window
scala> df2.show
+---------------+-------------------+
|             id|           DateTime|
+---------------+-------------------+
|340054675199675|15-01-2018 19:43:23|
|340054675199675|15-01-2018 10:56:43|
|340028465709212|10-01-2018 02:47:11|
|340054675199675|09-01-2018 10:59:10|
|340028465709212|02-01-2018 03:25:35|
|340054675199675|28-12-2017 05:48:04|
|340054675199675|21-12-2017 15:47:51|
|340028465709212|18-12-2017 10:33:04|
|340028465709212|16-12-2017 19:55:40|
|340028465709212|16-12-2017 19:55:40|
|340028465709212|12-12-2017 07:04:51|
|340054675199675|06-12-2017 08:52:38|
+---------------+-------------------+


scala> df2.printSchema
root
   |-- id: string (nullable = true)
   |-- DateTime: string (nullable = true)

 var df3 = df2.withColumn("DateTime",to_timestamp($"DateTime","dd-MM-yyyy HH:mm:ss")
 scala> df3.printSchema
 root
   |-- id: string (nullable = true)
   |-- DateTime: timestamp (nullable = true)

 val w= Window.partitionBy("id").orderBy("DateTime")
 val dfTop = df3.withColumn("rn", row_number.over(w)).filter($"rn"<6).drop(col("rn"))

 scala> dfTop.show
 +---------------+-------------------+
 |             id|           DateTime|
 +---------------+-------------------+
 |340028465709212|2017-12-12 07:04:51|
 |340028465709212|2017-12-16 19:55:40|
 |340028465709212|2017-12-16 19:55:40|
 |340028465709212|2017-12-18 10:33:04|
 |340028465709212|2018-01-02 03:25:35|
 |340054675199675|2017-12-06 08:52:38|
 |340054675199675|2017-12-21 15:47:51|
 |340054675199675|2017-12-28 05:48:04|
 |340054675199675|2018-01-09 10:59:10|
 |340054675199675|2018-01-15 10:56:43|
 +---------------+-------------------+