Scala 如何从Spark中的数组列中选择字段子集?
假设我有一个数据帧,如下所示:Scala 如何从Spark中的数组列中选择字段子集?,scala,apache-spark,dataframe,apache-spark-sql,Scala,Apache Spark,Dataframe,Apache Spark Sql,假设我有一个数据帧,如下所示: case class SubClass(id:String, size:Int,useless:String) case class MotherClass(subClasss: Array[SubClass]) val df = sqlContext.createDataFrame(List( MotherClass(Array( SubClass("1",1,"thisIsUseless"), SubClass("2
case class SubClass(id:String, size:Int,useless:String)
case class MotherClass(subClasss: Array[SubClass])
val df = sqlContext.createDataFrame(List(
MotherClass(Array(
SubClass("1",1,"thisIsUseless"),
SubClass("2",2,"thisIsUseless"),
SubClass("3",3,"thisIsUseless")
)),
MotherClass(Array(
SubClass("4",4,"thisIsUseless"),
SubClass("5",5,"thisIsUseless")
))
))
模式是:
根
|-子类:数组nullable=true
||-元素:struct containsnall=true
|| |-id:string nullable=true
|| |-size:integer nullable=false
|| |-无用:字符串null=true
我正在寻找一种只选择字段子集的方法:数组列子类的id和大小,但保留嵌套数组结构。
结果架构将是:
根
|-子类:数组nullable=true
||-元素:struct containsnall=true
|| |-id:string nullable=true
|| |-size:integer nullable=false
我已经试着做了一个实验
df.select("subClasss.id","subClasss.size")
但这会将数组子类拆分为两个数组:
根
|-id:array nullable=true
||-元素:字符串containsnall=true
|-大小:数组nullable=true
||-元素:整数containsnall=true
有没有一种方法可以保持原始结构,仅仅消除无用的字段?看起来像是:
df.select("subClasss.[id,size]")
谢谢您的时间。Spark>=2.4:
可以将数组_zip与cast一起使用:
import org.apache.spark.sql.functions.arrays_zip
df.select(arrays_zip(
$"subClasss.id", $"subClasss.size"
).cast("array<struct<id:string,size:int>>"))
import org.apache.spark.sql.Row
case class Record(id: String, size: Int)
val dropUseless = udf((xs: Seq[Row]) => xs.map{
case Row(id: String, size: Int, _) => Record(id, size)
})
df.select(dropUseless($"subClasss"))