Arrays 展平Spark数据帧子架构
给定一个嵌套的模式,我想展平一个子结构,同时保持其他模式结构。在本例中,我希望展平product.spec中包含的列,以便最终得到Arrays 展平Spark数据帧子架构,arrays,scala,apache-spark,Arrays,Scala,Apache Spark,给定一个嵌套的模式,我想展平一个子结构,同时保持其他模式结构。在本例中,我希望展平product.spec中包含的列,以便最终得到product.sid&product.desc import org.apache.spark.sql.functions._ import spark.implicits._ case class Spec(sid: Int, desc: String) case class Prod(pid: String, spec: Spec) val df = S
product.sid
&product.desc
import org.apache.spark.sql.functions._
import spark.implicits._
case class Spec(sid: Int, desc: String)
case class Prod(pid: String, spec: Spec)
val df = Seq(
(101, "jenn", Seq(1, 2), Seq(Spec(1, "A"), Spec(2, "B")), Prod("X11", Spec(11, "X")), 1100.0),
(202, "mike", Seq(3), Seq(Spec(3, "C")), Prod("Y22", Spec(22, "Y")), 2200.0)
).toDF("uid", "user", "ids", "specs", "product", "amount")
df.printSchema
>>>
root
|-- uid: integer (nullable = false)
|-- user: string (nullable = true)
|-- ids: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- specs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sid: integer (nullable = false)
| | |-- desc: string (nullable = true)
|-- product: struct (nullable = true)
| |-- pid: string (nullable = true)
| |-- spec: struct (nullable = true)
| | |-- sid: integer (nullable = false)
| | |-- desc: string (nullable = true)
|-- amount: double (nullable = false)
所需的新模式:
root
|-- uid: integer (nullable = false)
|-- user: string (nullable = true)
|-- ids: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- specs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sid: integer (nullable = false)
| | |-- desc: string (nullable = true)
|-- product: struct (nullable = true)
| |-- pid: string (nullable = true)
| |-- spec: struct (nullable = true)
| |-- sid: integer (nullable = false)
| |-- desc: string (nullable = true)
|-- amount: double (nullable = false)
遗憾的是,默认的SQL接口不允许轻松地完成这项工作,因为它使用列名
//incorrect result
df.withColumn("product.sid", col("product.spec.sid")).withColumn("product.desc", col("product.spec.desc")).drop("product.spec.desc").drop("product.spec.sid").printSchema
>>
root
|-- uid: integer (nullable = false)
|-- user: string (nullable = true)
|-- ids: array (nullable = true)
| |-- element: integer (containsNull = false)
|-- specs: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- sid: integer (nullable = false)
| | |-- desc: string (nullable = true)
|-- product: struct (nullable = true)
| |-- pid: string (nullable = true)
| |-- spec: struct (nullable = true)
| | |-- sid: integer (nullable = false)
| | |-- desc: string (nullable = true)
|-- amount: double (nullable = false)
|-- product.sid: integer (nullable = true)
|-- product.desc: string (nullable = true)
您可以重建
产品
结构列:
val df2 = df.withColumn("product", struct("product.pid", "product.spec.sid", "product.spec.desc"))
df2.show
+---+----+------+----------------+------------+------+
|uid|user| ids| specs| product|amount|
+---+----+------+----------------+------------+------+
|101|jenn|[1, 2]|[[1, A], [2, B]]|[X11, 11, X]|1100.0|
|202|mike| [3]| [[3, C]]|[Y22, 22, Y]|2200.0|
+---+----+------+----------------+------------+------+