使用Spark Scala从字符串到数组[StructType]的模式转换_Scala_Apache Spark_Apache Spark Sql

使用Spark Scala从字符串到数组[StructType]的模式转换

scala apache-spark

使用Spark Scala从字符串到数组[StructType]的模式转换,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有如下所示的示例数据，我需要使用spark scala代码将columnsABS、ALT从string转换为Array[structType]。任何帮助都将不胜感激在UDF的帮助下，我能够将字符串转换为arrayType，但需要一些关于将这两列从字符串转换为数组[structType]的帮助 VIN TT MSG_TYPE ABS ALT MSGXXXXXXXX 1 SIGL [{"E":1569XXXXXXX

我有如下所示的示例数据，我需要使用spark scala代码将columnsABS、ALT从string转换为Array[structType]。任何帮助都将不胜感激

在UDF的帮助下，我能够将字符串转换为arrayType，但需要一些关于将这两列从字符串转换为数组[structType]的帮助

VIN         TT  MSG_TYPE ABS                           ALT
MSGXXXXXXXX 1   SIGL     [{"E":1569XXXXXXX,"V":0.0}] 
[{"E":156957XXXXXX,"V":0.0}]

df.currentSchema 
root
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: string (nullable = true)
|-- ALT: string (nullable = true)

df.expectedSchema：

|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: double (nullable = true)

您可以使用解析Json并将其转换为结构数组

首先，定义一个基于答案解析Json的函数：

最后，我们在select语句中调用udf：

输出：

根 |-VIN:string nullable=true |-TT:string nullable=true |-MSG_TYPE:string nullable=true |-ABS:array nullable=true ||-元素：struct containsnall=true || |-E:string nullable=true || |-V:double nullable=false |-ALT:array nullable=true ||-元素：struct containsnall=true || |-E:string nullable=true || |-V:double nullable=false

如果您尝试以下操作，它也会起作用：

import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}

val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))

val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))

最终打印模式：

  root
 |-- VIN: string (nullable = true)
 |-- TT: string (nullable = true)
 |-- MSG_TYPE: string (nullable = true)
 |-- ABS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)
 |-- ALT: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)

非常感谢您的帮助@werner，您的解决方案运行良好，我需要一个帮助，是否有方法可以处理UDF中的空值，如果ABS或ALT为空，解决方案将失败。我在toStruct方法中添加了一个简单的空检查

val df = ...
val newdf = df.select('VIN, 'TT, 'MSG_TYPE, toStructUdf('ABS).as("ABS"), toStructUdf('ALT).as("ALT"))
newdf.printSchema()

import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}

val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))

val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))

  root
 |-- VIN: string (nullable = true)
 |-- TT: string (nullable = true)
 |-- MSG_TYPE: string (nullable = true)
 |-- ABS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)
 |-- ALT: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)