Scala 如何在spark sql中获得此输出？_Scala_Apache Spark_Apache Spark Sql

Scala 如何在spark sql中获得此输出？

scala apache-spark

Scala 如何在spark sql中获得此输出？,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,如何使用spark.sql获得列出每年所有电影的输出 Ouput: (1988,{(Rain Man),(Die Hard)}) (1990,{(The Godfather: Part III),(Die Hard 2),(The Silence of the Lambs),(King of New York)}) (1992,{(Unforgiven),(Bad Lieutenant),(Reservoir Dogs)}) (1994,{(Pulp Fiction)}) 这是json数据：

如何使用spark.sql获得列出每年所有电影的输出

Ouput:
(1988,{(Rain Man),(Die Hard)})
(1990,{(The Godfather: Part III),(Die Hard 2),(The Silence of the Lambs),(King of New York)})
(1992,{(Unforgiven),(Bad Lieutenant),(Reservoir Dogs)})
(1994,{(Pulp Fiction)})

这是json数据：

{ "id": "movie:1", "title": "Vertigo", "year": 1958, "genre": "Drama", "summary": "A retired San Francisco detective suffering from acrophobia investigates the strange activities of an old friend's wife, all the while becoming dangerously obsessed with her.", "country": "USA", "director": { "id": "artist:3", "last_name": "Hitchcock", "first_name": "Alfred", "year_of_birth": "1899" }, "actors": [ { "id": "artist:15", "role": "John Ferguson" }, { "id": "artist:16", "role": "Madeleine Elster" } ] }

以下是我尝试过的代码：

val hiveCtx=new org.apache.spark.sql.hive.HiveContext（sc）
val movies=hiveCtx.jsonFile（“movies.json”）
电影。createOrReplaceTempView（“电影”）
val ty=hiveCtx.sql（“从电影中选择年份、标题”）

请帮我找到正确的查询

谢谢您的帮助。

无需使用spark.sql，您就可以获得类似的结果。您只需对数据帧本身执行以下操作：

movies.groupBy（$“year”）.agg（concat_ws（“；”，collect_list（$“title”））.show

使用的数据集：

{ "id": "movie:1", "title": "Vertigo", "year": 1958, "genre": "Drama", "summary": "A retired San Francisco detective suffering from acrophobia investigates the strange activities of an old friend's wife, all the while becoming dangerously obsessed with her.", "country": "USA", "director": { "id": "artist:3", "last_name": "Hitchcock", "first_name": "Alfred", "year_of_birth": "1899" }, "actors": [ { "id": "artist:15", "role": "John Ferguson" }, { "id": "artist:16", "role": "Madeleine Elster" } ] }
{ "id": "movie:2", "title": "The Blob", "year": 1958, "genre": "Drama", "summary": "The Blob", "country": "USA", "director": { "id": "artist:3", "last_name": "Hitchcock", "first_name": "Alfred", "year_of_birth": "1899" }, "actors": [ { "id": "artist:15", "role": "John Ferguson" }, { "id": "artist:16", "role": "Madeleine Elster" } ] }

输出：

+----+----------------------------------+
|year|concat_ws(; , collect_list(title))|
+----+----------------------------------+
|1958|                 Vertigo; The Blob|
+----+----------------------------------+

您如何存储这些数据？您可以包含所有用于达到这一点的代码吗？创建hivectx:val hivectx=new org.apache.spark.sql.hive.HiveContext（sc）val movies=hivectx.jsonFile（“movies.json”）movies.createOrReplaceTempView（“movies”）现在我需要一个sql查询来获取列出每年所有电影的输出val ty=hiveCtx.sql（“选择年份，电影标题”）？