使用Spark Scala从字符串到数组[StructType]的模式转换
我有如下所示的示例数据,我需要使用spark scala代码将columnsABS、ALT从string转换为Array[structType]。任何帮助都将不胜感激 在UDF的帮助下,我能够将字符串转换为arrayType,但需要一些关于将这两列从字符串转换为数组[structType]的帮助使用Spark Scala从字符串到数组[StructType]的模式转换,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有如下所示的示例数据,我需要使用spark scala代码将columnsABS、ALT从string转换为Array[structType]。任何帮助都将不胜感激 在UDF的帮助下,我能够将字符串转换为arrayType,但需要一些关于将这两列从字符串转换为数组[structType]的帮助 VIN TT MSG_TYPE ABS ALT MSGXXXXXXXX 1 SIGL [{"E":1569XXXXXXX
VIN TT MSG_TYPE ABS ALT
MSGXXXXXXXX 1 SIGL [{"E":1569XXXXXXX,"V":0.0}]
[{"E":156957XXXXXX,"V":0.0}]
df.currentSchema
root
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: string (nullable = true)
|-- ALT: string (nullable = true)
df.expectedSchema:
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = true)
您可以使用解析Json并将其转换为结构数组
首先,定义一个基于答案解析Json的函数:
最后,我们在select语句中调用udf:
输出:
根
|-VIN:string nullable=true
|-TT:string nullable=true
|-MSG_TYPE:string nullable=true
|-ABS:array nullable=true
||-元素:struct containsnall=true
|| |-E:string nullable=true
|| |-V:double nullable=false
|-ALT:array nullable=true
||-元素:struct containsnall=true
|| |-E:string nullable=true
|| |-V:double nullable=false
如果您尝试以下操作,它也会起作用:
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}
val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))
val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))
最终打印模式:
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
非常感谢您的帮助@werner,您的解决方案运行良好,我需要一个帮助,是否有方法可以处理UDF中的空值,如果ABS或ALT为空,解决方案将失败。我在toStruct方法中添加了一个简单的空检查
val df = ...
val newdf = df.select('VIN, 'TT, 'MSG_TYPE, toStructUdf('ABS).as("ABS"), toStructUdf('ALT).as("ALT"))
newdf.printSchema()
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}
val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))
val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))
root
|-- VIN: string (nullable = true)
|-- TT: string (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)
|-- ALT: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- E: long (nullable = true)
| | |-- V: double (nullable = false)