Apache spark 如何动态更改模式？_Apache Spark_Apache Spark Sql

Apache spark 如何动态更改模式？

apache-spark

Apache spark 如何动态更改模式？,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我的dataframe架构如下所示： root |-- value: struct (nullable = true) | |-- before: struct (nullable = true) | | |-- id: long (nullable = false) | | |-- name: string (nullable = false) | | |-- n number of fields | |-- after: struc

我的dataframe架构如下所示：

root
 |-- value: struct (nullable = true)
 |    |-- before: struct (nullable = true)
 |    |    |-- id: long (nullable = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- n number of fields
 |    |-- after: struct (nullable = true)
 |    |    |-- id: long (nullable = false)
 |    |    |-- name: string (nullable = false)
 |    |-- op: string (nullable = false)
 |    |-- ts_ms: long (nullable = true)

root
 |    |-- after_id: long (nullable = false)  
 |    |-- after_name: string (nullable = false) 
 |    |-- before_id: long (nullable = false)  
 |    |-- before_name: string (nullable = false)
 |    |-- op: string (nullable = false)

下面的和后面的具有相同名称的相同字段集，它们是动态的。我希望模式如下所示：

root
 |-- value: struct (nullable = true)
 |    |-- before: struct (nullable = true)
 |    |    |-- id: long (nullable = false)
 |    |    |-- name: string (nullable = false)
 |    |    |-- n number of fields
 |    |-- after: struct (nullable = true)
 |    |    |-- id: long (nullable = false)
 |    |    |-- name: string (nullable = false)
 |    |-- op: string (nullable = false)
 |    |-- ts_ms: long (nullable = true)

root
 |    |-- after_id: long (nullable = false)  
 |    |-- after_name: string (nullable = false) 
 |    |-- before_id: long (nullable = false)  
 |    |-- before_name: string (nullable = false)
 |    |-- op: string (nullable = false)

我正在寻找一种使嵌套结构扁平化的方法，以及一种避免字段名重复的方法。

我认为问题更多的是开发一种处理递归数据结构的算法，而不是Spark本身。我没有现成的解决方案，但我会给你一些工具，可能会帮助你一点

架构是

StructType

，它可以有零个、一个或多个字段，其中包含

StructType

所有可能的类型都在包中

让我们（重新）创建模式：

import org.apache.spark.sql.types._
val schema = new StructType().add(
  StructField("value", new StructType().add(
    StructField("before", new StructType().add(
      StructField("id", LongType))))))
scala> println(schema.treeString)
root
 |-- value: struct (nullable = true)
 |    |-- before: struct (nullable = true)
 |    |    |-- id: long (nullable = true)

您应该轻松创建更复杂的

schema.fields

允许您访问当前级别的

StructFields

。只需复制并粘贴它们，直到找到另一个递归处理的

StructType

。等等你知道该怎么做

我今天刚刚发现的一个可能会有所帮助的方法是键入一个

select

，它可以进行开箱即用的展平。这需要每个级别的案例类，这对“动态”的要求可能不是很有帮助，但值得考虑。

让我们使用case类（重新）创建模式

case class Before(id: Long)
case class After(id: Long)
case class Value(before: Before, after: After)
case class Data(value: Value)

使用这些案例类，您可以简单地执行以下操作：

import org.apache.spark.sql.Encoders
val schema = Encoders.product[Data].schema
scala> println(schema.treeString)
root
 |-- value: struct (nullable = true)
 |    |-- before: struct (nullable = true)
 |    |    |-- id: long (nullable = false)
 |    |-- after: struct (nullable = true)
 |    |    |-- id: long (nullable = false)

这就是上面的模式

有了它，您可以免费展平：

val vs = Seq(Data((Value(Before(0), After(1))))).toDF
scala> vs.select($"value".as[Value]).printSchema
root
 |-- before: struct (nullable = true)
 |    |-- id: long (nullable = false)
 |-- after: struct (nullable = true)
 |    |-- id: long (nullable = false)

足够多的

选择s，您应该会很好。
我认为问题更多的是开发一种算法来处理递归数据结构，而不是Spark本身。我没有现成的解决方案，但我会给你一些工具，可能会帮助你一点

架构是StructType
，它可以有零个、一个或多个字段，其中包含StructType

所有可能的类型都在包中
让我们（重新）创建模式：
import org.apache.spark.sql.types._
val schema = new StructType().add(
  StructField("value", new StructType().add(
    StructField("before", new StructType().add(
      StructField("id", LongType))))))
scala> println(schema.treeString)
root
 |-- value: struct (nullable = true)
 |    |-- before: struct (nullable = true)
 |    |    |-- id: long (nullable = true)

您应该轻松创建更复杂的
schema.fields
允许您访问当前级别的StructFields
。只需复制并粘贴它们，直到找到另一个递归处理的StructType
。等等你知道该怎么做

我今天刚刚发现的一个可能会有所帮助的方法是键入一个select
，它可以进行开箱即用的展平。这需要每个级别的案例类，这对“动态”的要求可能不是很有帮助，但值得考虑。
让我们使用case类（重新）创建模式
case class Before(id: Long)
case class After(id: Long)
case class Value(before: Before, after: After)
case class Data(value: Value)

使用这些案例类，您可以简单地执行以下操作：
import org.apache.spark.sql.Encoders
val schema = Encoders.product[Data].schema
scala> println(schema.treeString)
root
 |-- value: struct (nullable = true)
 |    |-- before: struct (nullable = true)
 |    |    |-- id: long (nullable = false)
 |    |-- after: struct (nullable = true)
 |    |    |-- id: long (nullable = false)

这就是上面的模式
有了它，您可以免费展平：
val vs = Seq(Data((Value(Before(0), After(1))))).toDF
scala> vs.select($"value".as[Value]).printSchema
root
 |-- before: struct (nullable = true)
 |    |-- id: long (nullable = false)
 |-- after: struct (nullable = true)
 |    |-- id: long (nullable = false)

足够多的选择s，您应该会没事的