Apache spark 将spark数据帧中的日期时间戳转换为epocTimestamp_Apache Spark_Datetime_Apache Spark Sql_Java Time_Instant

Apache spark 将spark数据帧中的日期时间戳转换为epocTimestamp

apache-spark datetime

Apache spark 将spark数据帧中的日期时间戳转换为epocTimestamp,apache-spark,datetime,apache-spark-sql,java-time,instant,Apache Spark,Datetime,Apache Spark Sql,Java Time,Instant,我有一个拼花地板文件，带有时间戳列，格式为熊猫编写的2020-07-07 18:30:14.500000+00:00。当我在spark中读取同一拼花地板文件时，它被读取为2020-07-08 00:00:14.5 我想把它转换成以毫秒为单位的历元时间戳，这是1594146614500 我尝试过使用java日期时间格式 val dtformat = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS") dtforma

我有一个拼花地板文件，带有时间戳列，格式为熊猫编写的

2020-07-07 18:30:14.500000+00:00

。当我在spark中读取同一拼花地板文件时，它被读取为

2020-07-08 00:00:14.5

我想把它转换成以毫秒为单位的历元时间戳，这是

1594146614500

我尝试过使用java日期时间格式

val dtformat = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
dtformat.parse(r2.getAs[Long]("date_time").toString).getTime

它；s正在转换但值错误（1594146614005），而不是1594146614500

为了使其正确，我必须添加

dtformat.parse（r2.getAs[Long]（“date\u time”）.toString+“00”）.getTime

。还有比这更干净的方法吗

spark中是否有可用的函数将其读取为毫秒

更新1:

1594146614500

使用以下答案后：

df.withColumn（“timestamp”，to_timestamp（$“date\u time”，“yyyy-MM-dd HH:MM:ss.ssss-xxx”））。withColumn（“epoch”，“timestamp”）。cast（“十进制（20，10）”）*1000。cast（“bigint”））。show（）

缺点是假设数据的粒度为500ms，则每个时间戳都有两个相同的epoc时间戳，这是不需要的。

我建议您从

java.util

和相应的格式化API（

java.text.simpleDataFormat

）中切换过时的易出错日期/时间API到发件人和相应的格式化API（）。从

import java.time.OffsetDateTime;
import java.time.format.DateTimeFormatter;

public class Main {
    public static void main(String[] args) {
        OffsetDateTime odt = OffsetDateTime.parse("2020-07-07 18:30:14.500000+00:00",
                DateTimeFormatter.ofPattern("uuuu-MM-dd HH:mm:ss.SSSSSSZZZZZ"));
        System.out.println(odt.toInstant().toEpochMilli());
    }
}

输出：

1594146614500

我建议您将过时的易出错日期/时间API从

java.util

和相应的格式化API（

java.text.SimpleDataFormat

）切换到from和相应的格式化API（）。从

import java.time.OffsetDateTime;
import java.time.format.DateTimeFormatter;

public class Main {
    public static void main(String[] args) {
        OffsetDateTime odt = OffsetDateTime.parse("2020-07-07 18:30:14.500000+00:00",
                DateTimeFormatter.ofPattern("uuuu-MM-dd HH:mm:ss.SSSSSSZZZZZ"));
        System.out.println(odt.toInstant().toEpochMilli());
    }
}

输出：

1594146614500

使用spark数据帧功能

df.withColumn("timestamp", to_timestamp($"time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX"))
  .withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint"))
  .show(false)

+--------------------------------+---------------------+-------------+
|time                            |timestamp            |epoch        |
+--------------------------------+---------------------+-------------+
|2020-07-07 18:30:14.500000+00:00|2020-07-07 18:30:14.5|1594146614500|
+--------------------------------+---------------------+-------------+

这也是一种可行的方法。

使用spark数据帧功能

df.withColumn("timestamp", to_timestamp($"time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX"))
  .withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint"))
  .show(false)

+--------------------------------+---------------------+-------------+
|time                            |timestamp            |epoch        |
+--------------------------------+---------------------+-------------+
|2020-07-07 18:30:14.500000+00:00|2020-07-07 18:30:14.5|1594146614500|
+--------------------------------+---------------------+-------------+

这也是一种可能的方法。

我建议您不要使用

SimpleDateFormat

。那门课是出了名的麻烦和过时。而是使用

LocalDateTime

和

DateTimeFormatter

，两者都来自。此外，

SimpleDateFormat

无法解析

2020-07-08 00:00:14.5

。它只支持毫秒，精确到秒的三位小数。我建议您不要使用

SimpleDateFormat

。那门课是出了名的麻烦和过时。而是使用

LocalDateTime

和

DateTimeFormatter

，两者都来自。此外，

SimpleDateFormat

无法解析

2020-07-08 00:00:14.5

。它只支持毫秒，精确到秒的三位小数。感谢@Lamanus，使用它有一个缺点，如果数据的粒度假设为500毫秒，那么每个时间戳都有两个与我在问题中更新的值相同的值。我不明白。您的原始时间显示不正确，甚至格式看起来也不一样。感谢@Lamanus使用此方法有一个缺点，如果数据的粒度假定为500ms，那么每个时间戳都有两个与我在问题中更新的相同的值。我不明白。您的原始时间显示不正确，甚至格式看起来也不一样。