Scala 使用spark更改bigquery中嵌套数据的列名

Scala 使用spark更改bigquery中嵌套数据的列名,scala,apache-spark,google-bigquery,Scala,Apache Spark,Google Bigquery,我试图使用Spark Scala将一些数据写入BigQuery,我的Spark df看起来像 root |-- id: string (nullable = true) |-- cost: double (nullable = false) |-- nodes: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- settled: string (nullable =

我试图使用Spark Scala将一些数据写入BigQuery,我的Spark df看起来像

root
 |-- id: string (nullable = true)
 |-- cost: double (nullable = false)
 |-- nodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- settled: string (nullable = true)
 |    |    |-- constant: string (nullable = true)
 |-- status: string (nullable = true)
我试图更改数据帧的结构

val schema = StructType(Array(
  StructField("id", StringType, true),
  StructField("cost", DoubleType, true),
  StructField("nodes", StructType(Array(StructField("settled", StringType), StructField("constant", StringType)))),
  StructField("status", StringType, true)))

val actualDf = spark.createDataFrame(results, schema)
但它不起作用。当写入BigQuery时,列名如下所示:

id、成本、nodes.list.element.settled、nodes.list.element.constant、状态

是否有可能将这些列名更改为

id、成本、已结算、常量、状态

您可以分解节点数组以获得列的扁平结构,然后将数据帧写入bigquery

例如:

val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS

spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// |    |-- element: struct (containsNull = true)
// |    |    |-- constant: string (nullable = true)
// |    |    |-- settled: string (nullable = true)
// |-- status: string (nullable = true)

spark.read.json(jsn_ds).
      withColumn("expld",explode('nodes)).
      select("*","expld.*").
      drop("expld","nodes").
      show()

//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0|  1|     s|       p|      u|
//+----+---+------+--------+-------+
您可以分解节点数组以获得列的扁平结构,然后将数据帧写入bigquery

例如:

val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS

spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// |    |-- element: struct (containsNull = true)
// |    |    |-- constant: string (nullable = true)
// |    |    |-- settled: string (nullable = true)
// |-- status: string (nullable = true)

spark.read.json(jsn_ds).
      withColumn("expld",explode('nodes)).
      select("*","expld.*").
      drop("expld","nodes").
      show()

//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0|  1|     s|       p|      u|
//+----+---+------+--------+-------+

谢谢你的回答。我犯了以下错误。线程主org.apache.spark.sql.AnalysisException中出现异常:运算符中id178缺少已解析属性nodes106、cost179、nodes180、status181!生成爆炸节点106,true,false,[explad188];;项目[id178,成本179,节点180,状态181,解释188]+-!生成explodenodes106,true,false,[expld188]+-LogicalRDD[id178,cost179,nodes180,status181]代码,spark.read.jsonresults.toJSON.withColumnexpld,exploderesults.colnodes.select*,expld.*.dropexpld,nodes.show尝试此spark.read.jsonresults.toJSON.withColumnexpld,explodecolnodes.select*,expld.*.dropexpld,谢谢你的回答。我犯了以下错误。线程主org.apache.spark.sql.AnalysisException中出现异常:运算符中id178缺少已解析属性nodes106、cost179、nodes180、status181!生成爆炸节点106,true,false,[explad188];;项目[id178,成本179,节点180,状态181,解释188]+-!生成explodenodes106,true,false,[expld188]+-LogicalRDD[id178,cost179,nodes180,status181]代码,spark.read.jsonresults.toJSON.withColumnexpld,exploderesults.colnodes.select*,expld.*.dropexpld,nodes.show尝试此spark.read.jsonresults.toJSON.withColumnexpld,explodecolnodes.select*,expld.*.dropexpld,nodes.show