Scala 使用spark更改bigquery中嵌套数据的列名_Scala_Apache Spark_Google Bigquery

Scala 使用spark更改bigquery中嵌套数据的列名

scala apache-spark google-bigquery

Scala 使用spark更改bigquery中嵌套数据的列名,scala,apache-spark,google-bigquery,Scala,Apache Spark,Google Bigquery,我试图使用Spark Scala将一些数据写入BigQuery，我的Spark df看起来像 root |-- id: string (nullable = true) |-- cost: double (nullable = false) |-- nodes: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- settled: string (nullable =

我试图使用Spark Scala将一些数据写入BigQuery，我的Spark df看起来像

root
 |-- id: string (nullable = true)
 |-- cost: double (nullable = false)
 |-- nodes: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- settled: string (nullable = true)
 |    |    |-- constant: string (nullable = true)
 |-- status: string (nullable = true)

我试图更改数据帧的结构

val schema = StructType(Array(
  StructField("id", StringType, true),
  StructField("cost", DoubleType, true),
  StructField("nodes", StructType(Array(StructField("settled", StringType), StructField("constant", StringType)))),
  StructField("status", StringType, true)))

val actualDf = spark.createDataFrame(results, schema)

但它不起作用。当写入BigQuery时，列名如下所示：

id、成本、nodes.list.element.settled、nodes.list.element.constant、状态

是否有可能将这些列名更改为

id、成本、已结算、常量、状态

您可以分解节点数组以获得列的扁平结构，然后将数据帧写入bigquery

例如：

val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS

spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// |    |-- element: struct (containsNull = true)
// |    |    |-- constant: string (nullable = true)
// |    |    |-- settled: string (nullable = true)
// |-- status: string (nullable = true)

spark.read.json(jsn_ds).
      withColumn("expld",explode('nodes)).
      select("*","expld.*").
      drop("expld","nodes").
      show()

//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0|  1|     s|       p|      u|
//+----+---+------+--------+-------+

您可以分解节点数组以获得列的扁平结构，然后将数据帧写入bigquery

例如：

val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS

spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// |    |-- element: struct (containsNull = true)
// |    |    |-- constant: string (nullable = true)
// |    |    |-- settled: string (nullable = true)
// |-- status: string (nullable = true)

spark.read.json(jsn_ds).
      withColumn("expld",explode('nodes)).
      select("*","expld.*").
      drop("expld","nodes").
      show()

//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0|  1|     s|       p|      u|
//+----+---+------+--------+-------+

谢谢你的回答。我犯了以下错误。线程主org.apache.spark.sql.AnalysisException中出现异常：运算符中id178缺少已解析属性nodes106、cost179、nodes180、status181！生成爆炸节点106，true，false，[explad188]；；项目[id178，成本179，节点180，状态181，解释188]+-！生成explodenodes106，true，false，[expld188]+-LogicalRDD[id178，cost179，nodes180，status181]代码，spark.read.jsonresults.toJSON.withColumnexpld，exploderesults.colnodes.select*，expld.*.dropexpld，nodes.show尝试此spark.read.jsonresults.toJSON.withColumnexpld，explodecolnodes.select*，expld.*.dropexpld，谢谢你的回答。我犯了以下错误。线程主org.apache.spark.sql.AnalysisException中出现异常：运算符中id178缺少已解析属性nodes106、cost179、nodes180、status181！生成爆炸节点106，true，false，[explad188]；；项目[id178，成本179，节点180，状态181，解释188]+-！生成explodenodes106，true，false，[expld188]+-LogicalRDD[id178，cost179，nodes180，status181]代码，spark.read.jsonresults.toJSON.withColumnexpld，exploderesults.colnodes.select*，expld.*.dropexpld，nodes.show尝试此spark.read.jsonresults.toJSON.withColumnexpld，explodecolnodes.select*，expld.*.dropexpld，nodes.show