Apache spark 将RelationalGroupedDataset保存到HDFS时出错
我是Spark新手,尝试在txt文件中写入分组数据,但出现以下错误:Apache spark 将RelationalGroupedDataset保存到HDFS时出错,apache-spark,Apache Spark,我是Spark新手,尝试在txt文件中写入分组数据,但出现以下错误: Error:(55, 31) value write is not a member of org.apache.spark.sql.RelationalGroupedDataset 代码片段是- val dfyearlyGamesSelect = dfFiltered.select($"release_year",$"title") val dfyearlyGroup = dfyearlyGamesSelect.group
Error:(55, 31) value write is not a member of org.apache.spark.sql.RelationalGroupedDataset
代码片段是-
val dfyearlyGamesSelect = dfFiltered.select($"release_year",$"title")
val dfyearlyGroup = dfyearlyGamesSelect.groupBy($"release_year")
val dfWrite = dfyearlyGroup.write
.format("com.databricks.spark.csv")
.option("header","true")
.save(outputPath)
预期产量-
对于每一年,得分最高的游戏名称。(列年发布、标题、得分)
样本数据-
,score_phrase,title,url,platform,score,genre,editors_choice,release_year,release_month,release_day
0,Amazing,LittleBigPlanet PS Vita,/games/littlebigplanet-vita/vita-98907,PlayStation Vita,9.0,Platformer,Y,2012,9,12
1,Amazing,LittleBigPlanet PS Vita -- Marvel Super Hero Edition,/games/littlebigplanet-ps-vita-marvel-super-hero-edition/vita-20027059,PlayStation Vita,9.0,Platformer,Y,2012,9,12
2,Great,Splice: Tree of Life,/games/splice/ipad-141070,iPad,8.5,Puzzle,N,2012,9,12
3,Great,NHL 13,/games/nhl-13/xbox-360-128182,Xbox 360,8.5,Sports,N,2012,9,11
4,Great,NHL 13,/games/nhl-13/ps3-128181,PlayStation 3,8.5,Sports,N,2012,9,11
5,Good,Total War Battles: Shogun,/games/total-war-battles-shogun/mac-142565,Macintosh,7.0,Strategy,N,2012,9,11
6,Awful,Double Dragon: Neon,/games/double-dragon-neon/xbox-360-131320,Xbox 360,3.0,Fighting,N,2012,9,11
7,Amazing,Guild Wars 2,/games/guild-wars-2/pc-896298,PC,9.0,RPG,Y,2012,9,11
因为
write
不适用于GroupedData
。您必须应用聚合函数来获取数据帧,然后才能将其写入HDFS
val dfYearlyGroup = dfyearlyGamesSelect.groupBy($"release_year")
.agg( first($"title") as "title" )
现在,dfYearlyGroup
将是一个数据帧,您可以将其写入HDFS。此外,您不必将其存储在变量中,因为它不会返回任何内容
dfyearlyGroup.write
.format("com.databricks.spark.csv")
.option("header","true")
.save(outputPath)
编辑:
对于您的用例,您可以使用窗口函数rank
或rownum
,具体取决于您是否希望同一分数有多行
import org.apache.spark.sql.expressions.Window
df.select($"release_year", $"title", $"score").show(false)
+------------+----------------------------------------------------+-----+
|release_year|title |score|
+------------+----------------------------------------------------+-----+
|2012 |LittleBigPlanet PS Vita |9.0 |
|2012 |LittleBigPlanet PS Vita -- Marvel Super Hero Edition|9.0 |
|2012 |Splice: Tree of Life |8.5 |
|2012 |NHL 13 |8.5 |
|2012 |NHL 13 |8.5 |
|2012 |Total War Battles: Shogun |7.0 |
|2012 |Double Dragon: Neon |3.0 |
|2012 |Guild Wars 2 |9.0 |
+------------+----------------------------------------------------+-----+
val w = Window.partitionBy($"release_year").orderBy($"score".desc)
val dfYearlyMaxScore = df.withColumn("rank", rank over w)
.where($"rank" === lit(1) )
.select($"release_year", $"title", $"score")
dfYearlyMaxScore.show(false)
+------------+----------------------------------------------------+-----+
|release_year|title |score|
+------------+----------------------------------------------------+-----+
|2012 |LittleBigPlanet PS Vita |9.0 |
|2012 |LittleBigPlanet PS Vita -- Marvel Super Hero Edition|9.0 |
|2012 |Guild Wars 2 |9.0 |
+------------+----------------------------------------------------+-----+
现在,您可以使用以下方法编写:
dfYearlyMaxScore.write
.format("com.databricks.spark.csv")
.option("header","true")
.save(outputPath)
感谢philantrovert的回答,但有一个问题,这将导致字符串数组和CSV数据源不支持数组数据类型