Apache spark 防止火花在火花壳中移动时间戳_Apache Spark_Apache Spark Sql_Timezone_User Defined Functions

Apache spark 防止火花在火花壳中移动时间戳

apache-spark

Apache spark 防止火花在火花壳中移动时间戳,apache-spark,apache-spark-sql,timezone,user-defined-functions,Apache Spark,Apache Spark Sql,Timezone,User Defined Functions,我已使用以下方法启动spark shell： spark-shell --conf spark.sql.session.timeZone=utc 运行下面的示例时，结果是utc\u shifted列中的时间戳，该时间戳被移动。它不包含所需的UDF输出，而是包含其他内容。具体来说：输入是UTC，spark会再次将其移动。如何修复此行为 +-----------------+-----------------------+-------------------+ |value

我已使用以下方法启动spark shell：

spark-shell --conf spark.sql.session.timeZone=utc

运行下面的示例时，结果是

utc\u shifted

列中的时间戳，该时间戳被移动。它不包含所需的UDF输出，而是包含其他内容。具体来说：输入是UTC，spark会再次将其移动。如何修复此行为

+-----------------+-----------------------+-------------------+
|value            |utc_shifted            |fitting            |
+-----------------+-----------------------+-------------------+
|20191009145901202|2019-10-09 12:59:01.202|2019-10-09 14:59:01|
|20191009145514816|2019-10-09 12:55:14.816|2019-10-09 14:55:14|
+-----------------+-----------------------+-------------------+

看起来不传递默认时区参数可以解决这个问题，但我不确定其中一个执行者是否持有不同/错误的时区，我仍然得到正确的结果。所以我更喜欢设置它。为什么这对spark自己的时间戳解析没有影响？我的UDF如何获得类似的行为

可复制示例：

val input = Seq("20191009145901202", "20191009145514816").toDF

import scala.util.{Failure, Success, Try}
import java.sql.Timestamp
import java.text.SimpleDateFormat
import org.apache.spark.sql.DataFrame

def parseTimestampWithMillis(
      timestampColumnInput: String,
      timestampColumnOutput: String,
      formatString: String)(df: DataFrame): DataFrame = {
    def getTimestamp(s: String): Option[Timestamp] = {
      if (s.isEmpty) {
        None
      } else {
        val format = new SimpleDateFormat(formatString)
        Try(new Timestamp(format.parse(s).getTime)) match {
          case Success(t) => {
            println(s"input: ${s}, output: ${t}")
            Some(t)
          }
          case Failure(_) => None
        }
      }
    }

    val getTimestampUDF = udf(getTimestamp _)
    df.withColumn(
      timestampColumnOutput, getTimestampUDF(col(timestampColumnInput)))
  }


input.transform(parseTimestampWithMillis("value", "utc_shifted", "yyyyMMddHHmmssSSS")).withColumn("fitting", to_timestamp(col("value"), "yyyyMMddHHmmssSSS")).show(false)

+-----------------+-----------------------+-------------------+
|value            |utc_shifted            |fitting            |
+-----------------+-----------------------+-------------------+
|20191009145901202|2019-10-09 12:59:01.202|2019-10-09 14:59:01|
|20191009145514816|2019-10-09 12:55:14.816|2019-10-09 14:55:14|
+-----------------+-----------------------+-------------------+

事实上，此设置不仅影响显示，还影响写入文件时的输出

编辑基本上，我按照中的建议明确设置了时区，但得到的结果被我认为是错误的

spark-shell --conf spark.sql.session.timeZone=UTC --conf "spark.driver.extraJavaOptions=-Duser.timezone=UTC" --conf "spark.executor.extraJavaOptions=-Duser.timezone=UTC"

似乎给了我想要的结果

但它并没有解释为什么这只适用于我的UDF，而不适用于sparks内部

to_timestamp

函数