Apache spark Spark正在从DynamoDB Json运行嵌套模式_Apache Spark_Apache Spark Sql

Apache spark Spark正在从DynamoDB Json运行嵌套模式

apache-spark

Apache spark Spark正在从DynamoDB Json运行嵌套模式,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我处理的DynamoDB JSON与此类似： { "name" : {"S" : "John"}, "birthday": { "M" : { "month" : {"N": 1}, "year" : {"N": 2000}, "day"

我处理的DynamoDB JSON与此类似：

{ 
  "name" : {"S" : "John"},
  "birthday": {
    "M" : {
       "month" : {"N": 1},
       "year" : {"N": 2000},
       "day" : {"N": 2} 
    }
  }
}

当我在spark上读到这篇文章时

val df = spark.read.json("s3://path")

我得到了一个复杂的模式：

name : structType ( S : String),
birthday: structType (
  M : StructType (
    month : structType (N : int),
    year : structType (N : int),
    day : structType (N : int),
  )
)

相反，我想将模式更改为

name : String
birthday : structType (
  month : int
  year : int
  day : int
)

有办法做到这一点吗

事实上，我的模式比这个示例大得多，有许多深度嵌套的结构。我还想知道是否有“规范化”模式的动态方法

.selectExpr("name", "birthday.M as birthday")

或者你甚至可以把它完全展平到根部

.selectExpr("name", "birthday.M.*")

我能够使用

named_struct

函数：

df.selectExpr("""
named_struct (
  'name', name.S,
  'birthday', named_struct(
    'month', birthday.M.month.N as decimal,
    'year', birthday.M.year.N as decimal,
    'day', birthday.M.day.N as decimal,
  )
) as items
""")

这对我来说很好。

谢谢你的建议，戴夫。但是，这不只是使生日字段向上，但不会修复其中出现的“N”吗？啊，我没有看到架构的这一部分。如果有大量字段，则可以使用.schema（）执行某些操作。如果只是这些已知字段，您可以为每个onespark版本添加别名？？？@Srinivas Spark 2.4.3