使用Spark Scala解析字符串列以获取日期格式的数据

使用Spark Scala解析字符串列以获取日期格式的数据,scala,apache-spark,apache-spark-sql,user-defined-functions,Scala,Apache Spark,Apache Spark Sql,User Defined Functions,在我的.avro文件中,我有一个字符串类型的列(TriggeredDateTime),我需要使用Spark Scala获取yyyy-MM-dd HH:MM:ss格式的数据(如预期输出所示)。请让我知道有没有什么方法可以通过写一个UDF来实现这一点,而不是使用我下面的方法。任何帮助都将不胜感激 "TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56

在我的.avro文件中,我有一个字符串类型的列(TriggeredDateTime),我需要使用Spark Scala获取yyyy-MM-dd HH:MM:ss格式的数据(如预期输出所示)。请让我知道有没有什么方法可以通过写一个UDF来实现这一点,而不是使用我下面的方法。任何帮助都将不胜感激

 "TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}

   expected output
   _ _ _ _ _ _ _ _ _ _
  |TriggeredDateTime  |
  |___________________|
  |2019-05-16 04:56:19|
  |_ _ _ _ _ _ _ _ _ _|
我的做法:

我试图通过应用模式将.avro文件转换为JSON格式,然后我可以尝试解析JSON以获得所需的结果

DataFrame示例数据:

[{"vin":"FU7123456XXXXX","basetime":0,"dtctime":189834,"latitude":36.341587,"longitude":140.327676,"dtcs":[{"fmi":1,"spn":2631,"dtc":"470A01","id":1},{"fmi":0,"spn":0,"dtc":"000000","id":61}],"signals":[{"timestamp":78799,"spn":174,"value":45,"name":"PT"},{"timestamp":12345,"spn":0,"value":10.2,"name":"PT"},{"timestamp":194915,"spn":0,"value":0,"name":"PT"}],"sourceEcu":"MCM","TriggeredDateTime":{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}}]
DataFrame打印模式:

initialDF.printSchema
root
 |-- vin: string (nullable = true)
 |-- basetime: string (nullable = true)
 |-- dtctime: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- dtcs: string (nullable = true)
 |-- signals: string (nullable = true)
 |-- sourceEcu: string (nullable = true)
 |-- dtcTriggeredDateTime: string (nullable = true)

您可以使用内置的
get_json\u对象
来解析json行和
format_字符串
来提取所需的输出,而不是编写UDF

import org.apache.spark.sql.functions.{get\u json\u object,format\u string}
val df=Seq(
(“{”dateTime:{”date:{“date:{”年:2019,“月”:5,“日”:16},“时间:{”时:4,“分”:56,“秒”:19,“纳米”:480389000}},“偏移量:{”总秒:0}“,
(“{”日期时间“:{”日期“:{”年:2018,“月”:5,“日”:16}”,时间“{”小时:4,“分”:56,“秒”:19,“纳米”:480389000}”,偏移量“{”总秒数:0}”)
).toDF(“TriggeredDateTime”)
选择(
格式化字符串(“%s-%s-%s%s:%s:%s”,
获取对象($“TriggeredDateTime”,“$.dateTime.date.year”).as(“year”),
获取对象($“TriggeredDateTime”,“$.dateTime.date.month”).as(“month”),
获取对象($“TriggeredDateTime”,“$.dateTime.date.day”).as(“day”),
获取json对象($“TriggeredDateTime”,“$.dateTime.time.hour”).as(“hour”),
获取json对象($“TriggeredDateTime”,“$.dateTime.time.minute”).as(“min”),
获取json对象($“TriggeredDateTime”,“$.dateTime.time.second”).as(“秒”)
).as(“TriggeredDateTime”)
).show(假)
输出:

+-----------------+
|TriggeredDateTime|
+-----------------+
|2019-5-16 4:56:19|
|2018-5-16 4:56:19|
+-----------------+
函数
get_json_object
将字符串json转换为json对象,然后使用适当的选择器提取日期的每个部分,即:
$.dateTime.date.year
,我们将其作为参数添加到
format_string
函数中,以生成最终日期

更新:

为了完整性,我们可以使用
from\u json
提供我们已经知道的模式,而不是多次调用
get\u json\u object

import org.apache.spark.sql.functions.{from_json,format_string}
导入org.apache.spark.sql.types.{StructType,StructField,IntegerType}
val df=Seq(
(“{”dateTime:{”date:{“date:{”年:2019,“月”:5,“日”:16},“时间:{”时:4,“分”:56,“秒”:19,“纳米”:480389000}},“偏移量:{”总秒:0}“,
(“{”日期时间“:{”日期“:{”年:2018,“月”:5,“日”:16}”,时间“{”小时:4,“分”:56,“秒”:19,“纳米”:480389000}”,偏移量“{”总秒数:0}”)
).toDF(“TriggeredDateTime”)
val架构=
结构类型(Seq)(
结构字段(“日期时间”,结构类型(序号(
结构字段(“日期”,
结构类型(Seq)(
StructField(“年”,整数类型,false),
StructField(“月”,整数类型,false),
StructField(“天”,整数类型,false)
)
)
),
StructField(“时间”,
结构类型(Seq)(
StructField(“小时”,整数类型,false),
StructField(“分钟”,整数类型,false),
StructField(“第二个”,IntegerType,false),
StructField(“nano”,IntegerType,false)
)
)
)
)
)
),
结构字段(“偏移量”,结构类型(序号(
StructField(“totalSeconds”,IntegerType,false)
)         
)
)
))              
选择(
来自_json($“TriggeredDateTime”,schema).as(“parsedDateTime”)
)
.选择(
格式化字符串(“%s-%s-%s%s:%s:%s”,
$“parsedDateTime.dateTime.date.year.”作为(“年”),
$“parsedDateTime.dateTime.date.month”。作为(“月份”),
$“parsedDateTime.dateTime.date.day.”作为(“日”),
$“parsedDateTime.dateTime.time.hour”。作为(“hour”),
$“parsedDateTime.dateTime.time.minute.”作为(“min”),
$“parsedDateTime.dateTime.time.second.”作为(“秒”)
).as(“TriggeredDateTime”)
)
.show(假)
// +-----------------+
//| TriggeredDateTime|
// +-----------------+
// |2019-5-16 4:56:19|
// |2018-5-16 4:56:19|
// +-----------------+

请使用您编写的任何代码/udf/json解析器更新它。您能在从中读取数据后提供输入数据帧吗Avro@Nikk,我添加了数据框的示例数据和printSchema。