Apache spark 在spark中将任意日期格式转换为DD-MM-YYYY hh:MM:ss

Apache spark 在spark中将任意日期格式转换为DD-MM-YYYY hh:MM:ss,apache-spark,hive,apache-spark-sql,Apache Spark,Hive,Apache Spark Sql,我有一个包含日期列的文件。它包含多种格式的日期。我必须将所有内容转换为DD-MM-YYYY hh:MM:ss 写入以下查询,但未获得预期结果:- scala> val a = Seq(("01-Jul-2019"),("01-Jul-2019 00:01:05"),("Jul-01-2019"),("2019-07-01")).toDF("create_dts").select(col("create_dts")) a: org.apache.spark.sql.DataFrame = [

我有一个包含日期列的文件。它包含多种格式的日期。我必须将所有内容转换为
DD-MM-YYYY hh:MM:ss

写入以下查询,但未获得预期结果:-

scala> val a = Seq(("01-Jul-2019"),("01-Jul-2019 00:01:05"),("Jul-01-2019"),("2019-07-01")).toDF("create_dts").select(col("create_dts"))
a: org.apache.spark.sql.DataFrame = [create_dts: string]

scala>

scala> val r = a.withColumn("create_dts", date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("timestamp"), "dd-MM-yyyy hh:mm:ss")).show


+-------------------+
|         create_dts|
+-------------------+
|01-07-2019 12:00:00|
|01-07-2019 12:00:00|
|               null|
|               null|
+-------------------+

当条件下使用时,它现在工作正常

val a = Seq(("01-Jul-2019"),("01-07-2019")).toDF("create_dts")
val r = a.withColumn("create_dts",when(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date"), "dd-MM-yyyy")).when(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date"), "dd-MM-yyyy")))

当条件下使用时,它现在工作正常

val a = Seq(("01-Jul-2019"),("01-07-2019")).toDF("create_dts")
val r = a.withColumn("create_dts",when(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date"), "dd-MM-yyyy")).when(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date"), "dd-MM-yyyy")))

您可以使用
coalesce
函数获得第一个非空转换:

import org.apache.spark.sql.Column

def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
    coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
a.withColumn("converted", date_format(to_timestamp_multiple($"create_dts",
      Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd"))
    .cast("timestamp"), "dd-MM-yyyy hh:mm:ss")).show
结果是:

+--------------------+-------------------+
|          create_dts|          converted|
+--------------------+-------------------+
|         01-Jul-2019|01-07-2019 12:00:00|
|01-Jul-2019 00:01:05|01-07-2019 12:00:00|
|         Jul-01-2019|01-07-2019 12:00:00|
|          2019-07-01|01-07-2019 12:00:00|
+--------------------+-------------------+

您可以使用
coalesce
函数获得第一个非空转换:

import org.apache.spark.sql.Column

def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
    coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
a.withColumn("converted", date_format(to_timestamp_multiple($"create_dts",
      Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd"))
    .cast("timestamp"), "dd-MM-yyyy hh:mm:ss")).show
结果是:

+--------------------+-------------------+
|          create_dts|          converted|
+--------------------+-------------------+
|         01-Jul-2019|01-07-2019 12:00:00|
|01-Jul-2019 00:01:05|01-07-2019 12:00:00|
|         Jul-01-2019|01-07-2019 12:00:00|
|          2019-07-01|01-07-2019 12:00:00|
+--------------------+-------------------+