Spark Dataframe-如何根据ID和日期仅保留每个组的最新记录?
我有一个数据帧: DF: 如何只保存每组的最新记录?(上面有3组(1,2,3)) 结果应该是:Spark Dataframe-如何根据ID和日期仅保留每个组的最新记录?,dataframe,date,apache-spark,pyspark,Dataframe,Date,Apache Spark,Pyspark,我有一个数据帧: DF: 如何只保存每组的最新记录?(上面有3组(1,2,3)) 结果应该是: 1,2016-11-18 14:47:05 2,2016-10-12 22:24:25 3,2016-10-12 17:24:25 还试图提高效率(例如,在拥有1亿条记录的中等集群上,在短短几分钟内完成排序),因此应以最高效和正确的方式进行排序/排序。您可以使用窗口功能,如下所述: scala> val in = Seq((1,"2016-10-12 18:24:25"),
1,2016-11-18 14:47:05
2,2016-10-12 22:24:25
3,2016-10-12 17:24:25
还试图提高效率(例如,在拥有1亿条记录的中等集群上,在短短几分钟内完成排序),因此应以最高效和正确的方式进行排序/排序。您可以使用窗口功能,如下所述:
scala> val in = Seq((1,"2016-10-12 18:24:25"),
| (1,"2016-11-18 14:47:05"),
| (2,"2016-10-12 21:24:25"),
| (2,"2016-10-12 20:24:25"),
| (2,"2016-10-12 22:24:25"),
| (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("id").orderBy("ts desc")
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where("rank == 1").show(false)
+---+-------------------+----+
| id| ts|rank|
+---+-------------------+----+
| 1|2016-11-18 14:47:05| 1|
| 3|2016-10-12 17:24:25| 1|
| 2|2016-10-12 22:24:25| 1|
+---+-------------------+----+
你必须使用窗口功能 您必须在pyspark脚本下面按组和时间对窗口进行分区
从pyspark.sql.functions导入*
从pyspark.sql.window导入窗口
schema=“组int,时间戳”
df=spark.read.format('csv').schema(schema).options(header=False).load('/FileStore/tables/Group_window.txt'))
w=Window.partitionBy('Group').orderBy(desc('time'))
df=df.withColumn('Rank',density_Rank()。在(w)上方)
filter(df.Rank==1).drop(df.Rank.show())
+-----+-------------------+
|组|时间|
+-----+-------------------+
| 1|2016-11-18 14:47:05|
| 3|2016-10-12 17:24:25|
| 2|2016-10-12 22:24:25|
+-----+-------------------+ ```
你是如何在Pyspark中编写这篇文章的?我包含的链接在PySparkHanks中提供了几个例子,我能够将所有的代码(除了最后一行)转移到.withColumn(“rank”,row_number()。over(win))。where('rank==1)。show(false)这是否回答了你的问题?
scala> val in = Seq((1,"2016-10-12 18:24:25"),
| (1,"2016-11-18 14:47:05"),
| (2,"2016-10-12 21:24:25"),
| (2,"2016-10-12 20:24:25"),
| (2,"2016-10-12 22:24:25"),
| (3,"2016-10-12 17:24:25")).toDF("id", "ts")
in: org.apache.spark.sql.DataFrame = [id: int, ts: string]
scala> import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.expressions.Window
scala> val win = Window.partitionBy("id").orderBy("ts desc")
win: org.apache.spark.sql.expressions.WindowSpec = org.apache.spark.sql.expressions.WindowSpec@59fa04f7
scala> in.withColumn("rank", row_number().over(win)).where("rank == 1").show(false)
+---+-------------------+----+
| id| ts|rank|
+---+-------------------+----+
| 1|2016-11-18 14:47:05| 1|
| 3|2016-10-12 17:24:25| 1|
| 2|2016-10-12 22:24:25| 1|
+---+-------------------+----+