Scala 使用spark更改bigquery中嵌套数据的列名
我试图使用Spark Scala将一些数据写入BigQuery,我的Spark df看起来像Scala 使用spark更改bigquery中嵌套数据的列名,scala,apache-spark,google-bigquery,Scala,Apache Spark,Google Bigquery,我试图使用Spark Scala将一些数据写入BigQuery,我的Spark df看起来像 root |-- id: string (nullable = true) |-- cost: double (nullable = false) |-- nodes: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- settled: string (nullable =
root
|-- id: string (nullable = true)
|-- cost: double (nullable = false)
|-- nodes: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- settled: string (nullable = true)
| | |-- constant: string (nullable = true)
|-- status: string (nullable = true)
我试图更改数据帧的结构
val schema = StructType(Array(
StructField("id", StringType, true),
StructField("cost", DoubleType, true),
StructField("nodes", StructType(Array(StructField("settled", StringType), StructField("constant", StringType)))),
StructField("status", StringType, true)))
val actualDf = spark.createDataFrame(results, schema)
但它不起作用。当写入BigQuery时,列名如下所示:
id、成本、nodes.list.element.settled、nodes.list.element.constant、状态
是否有可能将这些列名更改为
id、成本、已结算、常量、状态您可以分解节点数组以获得列的扁平结构,然后将数据帧写入bigquery
例如:
val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS
spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- constant: string (nullable = true)
// | | |-- settled: string (nullable = true)
// |-- status: string (nullable = true)
spark.read.json(jsn_ds).
withColumn("expld",explode('nodes)).
select("*","expld.*").
drop("expld","nodes").
show()
//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0| 1| s| p| u|
//+----+---+------+--------+-------+
您可以分解节点数组以获得列的扁平结构,然后将数据帧写入bigquery
例如:
val jsn_ds=Seq("""{"id":1, "cost": "2.0","nodes":[{"settled":"u","constant":"p"}],"status":"s"}""").toDS
spark.read.json(jsn_ds).printSchema
// root
// |-- cost: string (nullable = true)
// |-- id: long (nullable = true)
// |-- nodes: array (nullable = true)
// | |-- element: struct (containsNull = true)
// | | |-- constant: string (nullable = true)
// | | |-- settled: string (nullable = true)
// |-- status: string (nullable = true)
spark.read.json(jsn_ds).
withColumn("expld",explode('nodes)).
select("*","expld.*").
drop("expld","nodes").
show()
//+----+---+------+--------+-------+
//|cost| id|status|constant|settled|
//+----+---+------+--------+-------+
//| 2.0| 1| s| p| u|
//+----+---+------+--------+-------+
谢谢你的回答。我犯了以下错误。线程主org.apache.spark.sql.AnalysisException中出现异常:运算符中id178缺少已解析属性nodes106、cost179、nodes180、status181!生成爆炸节点106,true,false,[explad188];;项目[id178,成本179,节点180,状态181,解释188]+-!生成explodenodes106,true,false,[expld188]+-LogicalRDD[id178,cost179,nodes180,status181]代码,spark.read.jsonresults.toJSON.withColumnexpld,exploderesults.colnodes.select*,expld.*.dropexpld,nodes.show尝试此spark.read.jsonresults.toJSON.withColumnexpld,explodecolnodes.select*,expld.*.dropexpld,谢谢你的回答。我犯了以下错误。线程主org.apache.spark.sql.AnalysisException中出现异常:运算符中id178缺少已解析属性nodes106、cost179、nodes180、status181!生成爆炸节点106,true,false,[explad188];;项目[id178,成本179,节点180,状态181,解释188]+-!生成explodenodes106,true,false,[expld188]+-LogicalRDD[id178,cost179,nodes180,status181]代码,spark.read.jsonresults.toJSON.withColumnexpld,exploderesults.colnodes.select*,expld.*.dropexpld,nodes.show尝试此spark.read.jsonresults.toJSON.withColumnexpld,explodecolnodes.select*,expld.*.dropexpld,nodes.show