Scala 将Spark sql groupby与max一起使用时未获取其他列？_Scala_Apache Spark Sql

Scala 将Spark sql groupby与max一起使用时未获取其他列？

scala

Scala 将Spark sql groupby与max一起使用时未获取其他列？,scala,apache-spark-sql,Scala,Apache Spark Sql,我有一个每年电影收视率的数据集 +--------------------+----------+----------+ | movie_title|imdb_score|title_year| +--------------------+----------+----------+ | Avatar?| 7.9| 2009| |Pirates of the Ca...| 7.1| 2007| |

我有一个每年电影收视率的数据集

+--------------------+----------+----------+
|         movie_title|imdb_score|title_year|
+--------------------+----------+----------+
|             Avatar?|       7.9|      2009|
|Pirates of the Ca...|       7.1|      2007|
|            Spectre?|       6.8|      2015|
|The Dark Knight R...|       8.5|      2012|
|Star Wars: Episod...|       7.1|      null|
|        John Carter?|       6.6|      2012|
|       Spider-Man 3?|       6.2|      2007|
|            Tangled?|       7.8|      2010|
|Avengers: Age of ...|       7.5|      2015|
|Harry Potter and ...|       7.5|      2009|
|Batman v Superman...|       6.9|      2016|
|   Superman Returns?|       6.1|      2006|
|  Quantum of Solace?|       6.7|      2008|
|Pirates of the Ca...|       7.3|      2006|
|    The Lone Ranger?|       6.5|      2013|
|       Man of Steel?|       7.2|      2013|
|The Chronicles of...|       6.6|      2008|
|       The Avengers?|       8.1|      2012|
|Pirates of the Ca...|       6.7|      2011|
|     Men in Black 3?|       6.8|      2012|
|The Hobbit: The B...|       7.5|      2014|
|The Amazing Spide...|       7.0|      2012|
|         Robin Hood?|       6.7|      2010|
|The Hobbit: The D...|       7.9|      2013|
| The Golden Compass?|       6.1|      2007|
|          King Kong?|       7.2|      2005|
|            Titanic?|       7.7|      1997|
|Captain America: ...|       8.2|      2016|
|         Battleship?|       5.9|      2012|
|     Jurassic World?|       7.0|      2015|
|            Skyfall?|       7.8|      2012|
|       Spider-Man 2?|       7.3|      2004|
|         Iron Man 3?|       7.2|      2013|
|Alice in Wonderland?|       6.5|      2010|
|X-Men: The Last S...|       6.8|      2006|
|Monsters University?|       7.3|      2013|
|Transformers: Rev...|       6.0|      2009|
|Transformers: Age...|       5.7|      2014|
|Oz the Great and ...|       6.4|      2013|
|The Amazing Spide...|       6.7|      2014|
|       TRON: Legacy?|       6.8|      2010|

我需要根据imdb_得分找到每年收视率最高的电影

我已经使用df.createOrReplaceTempView（“电影元数据”）创建了数据帧和临时视图

当我执行时

spark.sql（“按标题年从电影元数据组中选择最大值（imdb评分）、标题年”）

，我得到了正确的结果

+---------------+----------+
|max(imdb_score)|title_year|
+---------------+----------+
|            8.3|      1959|
|            8.7|      1990|
|            8.7|      1975|
|            8.7|      1977|
|            8.9|      2003|
|            8.4|      2007|
|            9.0|      1974|
|            8.6|      2015|
|            8.3|      1927|
|            8.1|      1955|
|            8.5|      2006|
|            8.2|      1978|
|            8.3|      1925|
|            8.3|      1961|

这是显示最高分数的那一年，但我需要电影的标题也有最高的分数。当我执行时

spark.sql（“按标题年从电影元数据组中选择最后一个（电影标题）、最大值（imdb评分）、标题年”）

电影名称作为最后一部或第一部，我没有得到那一年最高分数的正确电影名称。也没有第一个或最后一个函数得到异常。请建议我正确的方法。谢谢

您可以使用Windows：

df.createOrReplaceTempView("Movies")
sparkSession.sqlContext.sql("select title_year, movie_title, imdb_score from (select *, row_number() OVER (PARTITION BY title_year ORDER BY imdb_score DESC) as rn FROM Movies) tmp where rn = 1").show(false)```

如果希望不创建临时视图，请执行以下操作：

import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._

val window = Window.partitionBy("title_year").orderBy(col("imdb_score").desc)
df.withColumn("rn", row_number() over window).where(col("rn") === 1).drop(col("rn")).select(Seq(col("title_year"), col("movie_title"), col("imdb_score")): _*).show(false)

希望能有帮助