Warning: file_get_contents(/data/phpspider/zhask/data//catemap/2/scala/16.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
使用Spark Scala从字符串到数组[StructType]的模式转换_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

使用Spark Scala从字符串到数组[StructType]的模式转换

使用Spark Scala从字符串到数组[StructType]的模式转换,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,我有如下所示的示例数据,我需要使用spark scala代码将columnsABS、ALT从string转换为Array[structType]。任何帮助都将不胜感激 在UDF的帮助下,我能够将字符串转换为arrayType,但需要一些关于将这两列从字符串转换为数组[structType]的帮助 VIN TT MSG_TYPE ABS ALT MSGXXXXXXXX 1 SIGL [{"E":1569XXXXXXX

我有如下所示的示例数据,我需要使用spark scala代码将columnsABS、ALT从string转换为Array[structType]。任何帮助都将不胜感激

在UDF的帮助下,我能够将字符串转换为arrayType,但需要一些关于将这两列从字符串转换为数组[structType]的帮助

VIN         TT  MSG_TYPE ABS                           ALT
MSGXXXXXXXX 1   SIGL     [{"E":1569XXXXXXX,"V":0.0}] 
[{"E":156957XXXXXX,"V":0.0}]

df.currentSchema 
root
|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: string (nullable = true)
|-- ALT: string (nullable = true)
df.expectedSchema:

|-- VIN: string (nullable = true)
|-- TT: long (nullable = true)
|-- MSG_TYPE: string (nullable = true)
|-- ABS: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: long (nullable = true)
|-- ALT: array (nullable = true)
|    |-- element: struct (containsNull = true)
|    |    |-- E: long (nullable = true)
|    |    |-- V: double (nullable = true)
您可以使用解析Json并将其转换为结构数组

首先,定义一个基于答案解析Json的函数:

最后,我们在select语句中调用udf:

输出:

根 |-VIN:string nullable=true |-TT:string nullable=true |-MSG_TYPE:string nullable=true |-ABS:array nullable=true ||-元素:struct containsnall=true || |-E:string nullable=true || |-V:double nullable=false |-ALT:array nullable=true ||-元素:struct containsnall=true || |-E:string nullable=true || |-V:double nullable=false
如果您尝试以下操作,它也会起作用:

import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}

val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))

val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))
最终打印模式:

  root
 |-- VIN: string (nullable = true)
 |-- TT: string (nullable = true)
 |-- MSG_TYPE: string (nullable = true)
 |-- ABS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)
 |-- ALT: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)

非常感谢您的帮助@werner,您的解决方案运行良好,我需要一个帮助,是否有方法可以处理UDF中的空值,如果ABS或ALT为空,解决方案将失败。我在toStruct方法中添加了一个简单的空检查
val df = ...
val newdf = df.select('VIN, 'TT, 'MSG_TYPE, toStructUdf('ABS).as("ABS"), toStructUdf('ALT).as("ALT"))
newdf.printSchema()
import org.apache.spark.sql.types.{StructField, StructType, ArrayType, StringType}

val schema = ArrayType(StructType(Seq(StructField("E", LongType), StructField("V", DoubleType))))

val final_df = newDF.withColumn("ABS", from_json($"ABS", schema)).withColumn("ALT", from_json($"ALT", schema))
  root
 |-- VIN: string (nullable = true)
 |-- TT: string (nullable = true)
 |-- MSG_TYPE: string (nullable = true)
 |-- ABS: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)
 |-- ALT: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- E: long (nullable = true)
 |    |    |-- V: double (nullable = false)