Scala 将字符串转换为日期SparkSQL
我是Scala的新手,我有一个dataframe,我正在尝试一列dataframe的日期,从字符串开始,换言之,如下所示Scala 将字符串转换为日期SparkSQL,scala,apache-spark,dataframe,Scala,Apache Spark,Dataframe,我是Scala的新手,我有一个dataframe,我正在尝试一列dataframe的日期,从字符串开始,换言之,如下所示 1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56) 2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11) 第一种情况我能成功完成,但第二种情况有问题,由于这一原因,我无法将时间转换为日
1) yyyyMMddHHmmss(20150610120256) ->yyyy-MM-dd HH:mm:ss(2015-06-10 12:02:56)
2) yyyyMMddHHmmss(20150611 ) ->yyyy-MM-dd(2015-06-11)
第一种情况我能成功完成,但第二种情况有问题,由于这一原因,我无法将时间转换为日期。更多详细信息,您可以在下面获得。任何帮助将不胜感激
df.printSchema
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: long (nullable = true)
|-- IN_DATE: string (nullable = true)
df.show
Input
+-----+-------+---------+---------+-------------------+-----------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+-----------------+
| F | 000544| 2017002| OP | 95032015062763298| 20150610120256 |
| F | 000544| 2017002| LD | 95032015062763261| 20150611 |
| F | 000544| 2017002| AK | 95037854336743246| 20150611012356 |
+-----+-------+---------+--+------+-------------------+-----------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("date")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| null |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
df=df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast("timestamp")))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmss").cast("timestamp")))
Actual output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 00:00:00 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
Expected output
+-----+-------+---------+---------+-------------------+----------------------+
| TYPE| CODE| SQ_CODE| RE_TYPE | VERY_ID| IN_DATE |
+-----+-------+---------+---------+-------------------+----------------------+
| F | 000544| 2017002| OP | 95032015062763298| 2015-06-10 12:02:56 |
| F | 000544| 2017002| LD | 95032015062763261| 2015-06-11 |
| F | 000544| 2017002| AK | 95037854336743246| 2015-06-11 01:23:56 |
+-----+-------+---------+--+------+-------------------+----------------------+
请尝试此查询
df.withColumn("IN_DATE",when(lit(length(regexp_replace(df("IN_DATE"),"\\s+",""))) === lit(8) ,
to_date(from_unixtime(regexp_replace(df("IN_DATE"),"\\s+",""),"yyyyMMdd").cast(DateType)))
.otherwise(unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType)))
有几个选项可以实现日期解析器
TODATE()
。这一执行的重要性2015-06-11
格式为spark.sql.types.DateType
和2015-06-10 12:02:56
isspark.sql.types.TimestampType
同一列上不能有两个数据类型。架构的每一列应该只有一个数据类型
我建议您创建两个新的专栏,并在其中使用您希望的格式
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{DateType, TimestampType}
df.withColumn("IN_DATE_DateOnly",from_unixtime(unix_timestamp(df("IN_DATE"),"yyyyMMdd")).cast(DateType))
.withColumn("IN_DATE_DateAndTime",unix_timestamp(df("IN_DATE"),"yyyyMMddHHmmSS").cast(TimestampType))
这将为您提供dataframe
as
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|TYPE|CODE |SQ_CODE|RE_TYPE|VERY_ID |IN_DATE |IN_DATE_DateOnly|IN_DATE_DateAndTime |
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
|F |000544|2017002|OP |95032015062763298|20150610120256|null |2015-06-10 12:02:00.0|
|F |000544|2017002|LD |95032015062763261|20150611 |2015-06-11 |null |
|F |000544|2017002|AK |95037854336743246|20150611012356|null |2015-06-11 01:23:00.0|
+----+------+-------+-------+-----------------+--------------+----------------+---------------------+
您可以看到数据类型是不同的
root
|-- TYPE: string (nullable = true)
|-- CODE: string (nullable = true)
|-- SQ_CODE: string (nullable = true)
|-- RE_TYPE: string (nullable = true)
|-- VERY_ID: string (nullable = true)
|-- IN_DATE: string (nullable = true)
|-- IN_DATE_DateOnly: date (nullable = true)
|-- IN_DATE_DateAndTime: timestamp (nullable = true)
我希望答案有帮助我会的
- 选择更精确的数据类型-此处
TimestampType
以不同的格式合并
这是不可能的,因为两个日期的数据类型不同。一个是TimestampType,另一个是DateType,同一列不能有两个模式?我们在spark sql中有
trim
。
import org.apache.spark.sql.functions._
val df = Seq("20150610120256", "20150611").toDF("IN_DATE")
df.withColumn("IN_DATE", coalesce(
to_timestamp($"IN_DATE", "yyyyMMddHHmmss"),
to_timestamp($"IN_DATE", "yyyyMMdd"))).show
+-------------------+
| IN_DATE|
+-------------------+
|2015-06-10 12:02:56|
|2015-06-11 00:00:00|
+-------------------+