Apache spark 将RelationalGroupedDataset保存到HDFS时出错

Apache spark 将RelationalGroupedDataset保存到HDFS时出错,apache-spark,Apache Spark,我是Spark新手,尝试在txt文件中写入分组数据,但出现以下错误: Error:(55, 31) value write is not a member of org.apache.spark.sql.RelationalGroupedDataset 代码片段是- val dfyearlyGamesSelect = dfFiltered.select($"release_year",$"title") val dfyearlyGroup = dfyearlyGamesSelect.group

我是Spark新手,尝试在txt文件中写入分组数据,但出现以下错误:

Error:(55, 31) value write is not a member of org.apache.spark.sql.RelationalGroupedDataset
代码片段是-

val dfyearlyGamesSelect = dfFiltered.select($"release_year",$"title")
val dfyearlyGroup = dfyearlyGamesSelect.groupBy($"release_year")
val dfWrite = dfyearlyGroup.write
                           .format("com.databricks.spark.csv")
                           .option("header","true")
                           .save(outputPath)
预期产量- 对于每一年,得分最高的游戏名称。(列年发布、标题、得分)

样本数据-

,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,/games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11
5,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/mac-142565,Macintosh,7.0,Strategy,N,2012,9,11
6,Awful,Double Dragon: Neon,/games/double-dragon-neon/xbox-360-131320,Xbox 360,3.0,Fighting,N,2012,9,11
7,Amazing,Guild Wars 2,/games/guild-wars-2/pc-896298,PC,9.0,RPG,Y,2012,9,11

因为
write
不适用于
GroupedData
。您必须应用聚合函数来获取数据帧,然后才能将其写入HDFS

val dfYearlyGroup = dfyearlyGamesSelect.groupBy($"release_year")
                                       .agg( first($"title") as "title" )
现在,
dfYearlyGroup
将是一个数据帧,您可以将其写入HDFS。此外,您不必将其存储在变量中,因为它不会返回任何内容

dfyearlyGroup.write
             .format("com.databricks.spark.csv")
             .option("header","true")
             .save(outputPath)
编辑: 对于您的用例,您可以使用窗口函数
rank
rownum
,具体取决于您是否希望同一分数有多行

import org.apache.spark.sql.expressions.Window

df.select($"release_year", $"title", $"score").show(false)
+------------+----------------------------------------------------+-----+
|release_year|title                                               |score|
+------------+----------------------------------------------------+-----+
|2012        |LittleBigPlanet PS Vita                             |9.0  |
|2012        |LittleBigPlanet PS Vita -- Marvel Super Hero Edition|9.0  |
|2012        |Splice: Tree of Life                                |8.5  |
|2012        |NHL 13                                              |8.5  |
|2012        |NHL 13                                              |8.5  |
|2012        |Total War Battles: Shogun                           |7.0  |
|2012        |Double Dragon: Neon                                 |3.0  |
|2012        |Guild Wars 2                                        |9.0  |
+------------+----------------------------------------------------+-----+


val w = Window.partitionBy($"release_year").orderBy($"score".desc)

val dfYearlyMaxScore = df.withColumn("rank", rank over w)
                         .where($"rank" === lit(1) )
                         .select($"release_year", $"title", $"score")

dfYearlyMaxScore.show(false)

+------------+----------------------------------------------------+-----+
|release_year|title                                               |score|
+------------+----------------------------------------------------+-----+
|2012        |LittleBigPlanet PS Vita                             |9.0  |
|2012        |LittleBigPlanet PS Vita -- Marvel Super Hero Edition|9.0  |
|2012        |Guild Wars 2                                        |9.0  |
+------------+----------------------------------------------------+-----+
现在,您可以使用以下方法编写:

dfYearlyMaxScore.write
                .format("com.databricks.spark.csv")
                .option("header","true")
                .save(outputPath)

感谢philantrovert的回答,但有一个问题,这将导致字符串数组和CSV数据源不支持数组数据类型