使用Spark、Scala在查询结果的数组结构类型中创建自定义输出
我正在尝试读取DataFrame并创建以下输出格式中提到的给定结构 InputDF-使用Spark、Scala在查询结果的数组结构类型中创建自定义输出,scala,apache-spark,Scala,Apache Spark,我正在尝试读取DataFrame并创建以下输出格式中提到的给定结构 InputDF- id genre netid seconds team 1 A y 0 T1 1 B l 781623 M1 - S1 1 B l 623281 N1 - E1 [id: string (nullable = true), genre: array ( struct (name: string,
id genre netid seconds team
1 A y 0 T1
1 B l 781623 M1 - S1
1 B l 623281 N1 - E1
[id: string (nullable = true),
genre: array ( struct (name: string, seconds: long),..),
netid: array ( struct (name: string, seconds: long),..),
team: array ( struct (name: string, seconds: long, netid: string, genre: string)]
[id: 1,
genre: array ( struct (name: A, seconds: 0),struct (name: B, seconds: 781623),struct (name: B, seconds: 623281))
netid: array ( struct (name: y, viewedMilliseconds: 0),struct (name: l, viewedMilliseconds: 781623),struct (name: y, viewedMilliseconds: 623281))
team: array ( struct (name: A, seconds: 0, netid: y, genre: string, team:T1),struct (name: B, seconds: 781623, netid: l, genre: B, team:M1 - S1),struct (name: B, seconds: 623281, netid: l, genre: B, team:N1 - E1)]
预期输出格式-
id genre netid seconds team
1 A y 0 T1
1 B l 781623 M1 - S1
1 B l 623281 N1 - E1
[id: string (nullable = true),
genre: array ( struct (name: string, seconds: long),..),
netid: array ( struct (name: string, seconds: long),..),
team: array ( struct (name: string, seconds: long, netid: string, genre: string)]
[id: 1,
genre: array ( struct (name: A, seconds: 0),struct (name: B, seconds: 781623),struct (name: B, seconds: 623281))
netid: array ( struct (name: y, viewedMilliseconds: 0),struct (name: l, viewedMilliseconds: 781623),struct (name: y, viewedMilliseconds: 623281))
team: array ( struct (name: A, seconds: 0, netid: y, genre: string, team:T1),struct (name: B, seconds: 781623, netid: l, genre: B, team:M1 - S1),struct (name: B, seconds: 623281, netid: l, genre: B, team:N1 - E1)]
有值输出-
id genre netid seconds team
1 A y 0 T1
1 B l 781623 M1 - S1
1 B l 623281 N1 - E1
[id: string (nullable = true),
genre: array ( struct (name: string, seconds: long),..),
netid: array ( struct (name: string, seconds: long),..),
team: array ( struct (name: string, seconds: long, netid: string, genre: string)]
[id: 1,
genre: array ( struct (name: A, seconds: 0),struct (name: B, seconds: 781623),struct (name: B, seconds: 623281))
netid: array ( struct (name: y, viewedMilliseconds: 0),struct (name: l, viewedMilliseconds: 781623),struct (name: y, viewedMilliseconds: 623281))
team: array ( struct (name: A, seconds: 0, netid: y, genre: string, team:T1),struct (name: B, seconds: 781623, netid: l, genre: B, team:M1 - S1),struct (name: B, seconds: 623281, netid: l, genre: B, team:N1 - E1)]
我无法创建它。有人能帮我一下吗?我认为您的预期结果不一致,但这应该是一般的方法:
import org.apache.spark.sql.functions._
import spark.implicits._
df
.groupBy($"id")
.agg(
collect_list(struct($"genre".as("name"),$"seconds")).as("genre"),
collect_list(struct($"netid".as("name"),$"seconds")).as("netid"),
collect_list(struct($"genre".as("name"),$"seconds",$"netid",$"team")).as("team")
)