Scala 从struct元素的嵌套数组创建Spark数据帧?
我已将一个JSON文件读入Spark。此文件具有以下结构:Scala 从struct元素的嵌套数组创建Spark数据帧?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我已将一个JSON文件读入Spark。此文件具有以下结构: root |-- engagement: struct (nullable = true) | |-- engagementItems: array (nullable = true) | | |-- element: struct (containsNull = true) | | | |-- availabilityEngagement: st
root
|-- engagement: struct (nullable = true)
| |-- engagementItems: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- availabilityEngagement: struct (nullable = true)
| | | | |-- dimapraUnit: struct (nullable = true)
| | | | | |-- code: string (nullable = true)
| | | | | |-- constrained: boolean (nullable = true)
| | | | | |-- id: long (nullable = true)
| | | | | |-- label: string (nullable = true)
| | | | | |-- ranking: long (nullable = true)
| | | | | |-- type: string (nullable = true)
| | | | | |-- version: long (nullable = true)
| | | | | |-- visible: boolean (nullable = true)
我创建了一个递归函数,用嵌套StructType的列展平模式
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case _ => Array(col(colName).alias(colName))
}
})
}
val newDF=SIWINSDF.select(flattenSchema(SIWINSDF.schema):_*)
val secondDF=newDF.toDF(newDF.columns.map(_.replace(".", "_")): _*)
如何展平包含嵌套StructType的ArrayType,例如engagementItems:array(nullable=true)
非常感谢您的帮助。这里的问题是您需要管理
ArrayType
的案例,然后将其转换为StructType
。因此,您可以使用Scala运行时转换来实现这一点
首先,我生成了下一个场景(顺便说一句,在您的问题中包含这一点会非常有帮助,因为这会使问题的再现更加容易):
这将打印出:
root
|-- engagement: struct (nullable = true)
| |-- engagementItems: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- availabilityEngagement: struct (nullable = true)
| | | | |-- dimapraUnit: struct (nullable = true)
| | | | | |-- code: string (nullable = true)
| | | | | |-- constrained: boolean (nullable = false)
| | | | | |-- id: long (nullable = false)
| | | | | |-- label: string (nullable = true)
| | | | | |-- ranking: long (nullable = false)
| | | | | |-- _type: string (nullable = true)
| | | | | |-- version: long (nullable = false)
| | | | | |-- visible: boolean (nullable = false)
然后,我修改了您的函数,为ArrayType添加了一个额外的检查,并使用asInstanceOf
将其转换为StructType:
import org.apache.spark.sql.types._
def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
{
schema.fields.flatMap(f => {
val colName = if (prefix == null) f.name else (prefix + "." + f.name)
f.dataType match {
case st: StructType => flattenSchema(st, colName)
case at: ArrayType =>
val st = at.elementType.asInstanceOf[StructType]
flattenSchema(st, colName)
case _ => Array(new Column(colName).alias(colName))
}
})
}
最后是结果:
val s = getSchema()
val res = flattenSchema(s)
res.foreach(println(_))
输出:
engagement.engagementItems.availabilityEngagement.dimapraUnit.code AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.code`
engagement.engagementItems.availabilityEngagement.dimapraUnit.constrained AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.constrained`
engagement.engagementItems.availabilityEngagement.dimapraUnit.id AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.id`
engagement.engagementItems.availabilityEngagement.dimapraUnit.label AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.label`
engagement.engagementItems.availabilityEngagement.dimapraUnit.ranking AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.ranking`
engagement.engagementItems.availabilityEngagement.dimapraUnit._type AS `engagement.engagementItems.availabilityEngagement.dimapraUnit._type`
engagement.engagementItems.availabilityEngagement.dimapraUnit.version AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.version`
engagement.engagementItems.availabilityEngagement.dimapraUnit.visible AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.visible`
数组是否具有固定长度?如果没有,你要做的事情会很复杂。。。为了帮助我们,您能提供一些示例输入和预期输出吗?您还可以将问题简化为出现问题的最小模式。如果是
ArrayType
,则应在dataframe
上执行explode
操作。我更喜欢在dataframe上使用explode功能的通用代码。我需要知道如何获取所有数组的名称?Hi@J-kram答案基于以下问题:如何展平包含嵌套StructType的ArrayType,例如engagementItems:array(nullable=true)您的版本正在停止engagementItems上内部元素的进程,从而返回availabilityEngagement的数据。因此,为了达到dimapraUnit的最后一级,我改变了主意,也能够处理这个案件。因此,如果您尝试执行val newDF=SIWINSDF.select(flattschema(SIWINSDF.schema):*)
操作,则会展平dimapraUnit结构。这不是你的问题吗?嗨,Alexandros Biratsis是的,这是正确的。在dataframe上还有一个更简单的分解函数val providersDF=SIWINSDF.select(分解(col(“engagementItems”).as(“collection”).select(col(“collection.*))要分解所有数组,我想准确地获得数组的所有名称@J-kram,那么我的帖子是否回答了你的问题:)?我不确定我是否理解了ToodNice@Alexandros Biratsis,我现在的问题是如何获得数据帧上所有数组的名称,比如engagementItems?
engagement.engagementItems.availabilityEngagement.dimapraUnit.code AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.code`
engagement.engagementItems.availabilityEngagement.dimapraUnit.constrained AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.constrained`
engagement.engagementItems.availabilityEngagement.dimapraUnit.id AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.id`
engagement.engagementItems.availabilityEngagement.dimapraUnit.label AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.label`
engagement.engagementItems.availabilityEngagement.dimapraUnit.ranking AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.ranking`
engagement.engagementItems.availabilityEngagement.dimapraUnit._type AS `engagement.engagementItems.availabilityEngagement.dimapraUnit._type`
engagement.engagementItems.availabilityEngagement.dimapraUnit.version AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.version`
engagement.engagementItems.availabilityEngagement.dimapraUnit.visible AS `engagement.engagementItems.availabilityEngagement.dimapraUnit.visible`