Apache spark 在spark中将任意日期格式转换为DD-MM-YYYY hh:MM:ss
我有一个包含日期列的文件。它包含多种格式的日期。我必须将所有内容转换为Apache spark 在spark中将任意日期格式转换为DD-MM-YYYY hh:MM:ss,apache-spark,hive,apache-spark-sql,Apache Spark,Hive,Apache Spark Sql,我有一个包含日期列的文件。它包含多种格式的日期。我必须将所有内容转换为DD-MM-YYYY hh:MM:ss 写入以下查询,但未获得预期结果:- scala> val a = Seq(("01-Jul-2019"),("01-Jul-2019 00:01:05"),("Jul-01-2019"),("2019-07-01")).toDF("create_dts").select(col("create_dts")) a: org.apache.spark.sql.DataFrame = [
DD-MM-YYYY hh:MM:ss
写入以下查询,但未获得预期结果:-
scala> val a = Seq(("01-Jul-2019"),("01-Jul-2019 00:01:05"),("Jul-01-2019"),("2019-07-01")).toDF("create_dts").select(col("create_dts"))
a: org.apache.spark.sql.DataFrame = [create_dts: string]
scala>
scala> val r = a.withColumn("create_dts", date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("timestamp"), "dd-MM-yyyy hh:mm:ss")).show
+-------------------+
| create_dts|
+-------------------+
|01-07-2019 12:00:00|
|01-07-2019 12:00:00|
| null|
| null|
+-------------------+
当条件下使用时,它现在工作正常
val a = Seq(("01-Jul-2019"),("01-07-2019")).toDF("create_dts")
val r = a.withColumn("create_dts",when(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date"), "dd-MM-yyyy")).when(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date"), "dd-MM-yyyy")))
当条件下使用时,它现在工作正常
val a = Seq(("01-Jul-2019"),("01-07-2019")).toDF("create_dts")
val r = a.withColumn("create_dts",when(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MMM-yyyy").cast("date"), "dd-MM-yyyy")).when(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date").isNotNull,date_format(to_timestamp($"create_dts", "dd-MM-yyyy").cast("date"), "dd-MM-yyyy")))
您可以使用
coalesce
函数获得第一个非空转换:
import org.apache.spark.sql.Column
def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
a.withColumn("converted", date_format(to_timestamp_multiple($"create_dts",
Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd"))
.cast("timestamp"), "dd-MM-yyyy hh:mm:ss")).show
结果是:
+--------------------+-------------------+
| create_dts| converted|
+--------------------+-------------------+
| 01-Jul-2019|01-07-2019 12:00:00|
|01-Jul-2019 00:01:05|01-07-2019 12:00:00|
| Jul-01-2019|01-07-2019 12:00:00|
| 2019-07-01|01-07-2019 12:00:00|
+--------------------+-------------------+
您可以使用
coalesce
函数获得第一个非空转换:
import org.apache.spark.sql.Column
def to_timestamp_multiple(s: Column, formats: Seq[String]): Column = {
coalesce(formats.map(fmt => to_timestamp(s, fmt)):_*)
}
a.withColumn("converted", date_format(to_timestamp_multiple($"create_dts",
Seq("dd-MMM-yyyy", "MMM-dd-yyyy", "yyyy-MM-dd"))
.cast("timestamp"), "dd-MM-yyyy hh:mm:ss")).show
结果是:
+--------------------+-------------------+
| create_dts| converted|
+--------------------+-------------------+
| 01-Jul-2019|01-07-2019 12:00:00|
|01-Jul-2019 00:01:05|01-07-2019 12:00:00|
| Jul-01-2019|01-07-2019 12:00:00|
| 2019-07-01|01-07-2019 12:00:00|
+--------------------+-------------------+