Scala中的时间类型差异和重置小时

Scala中的时间类型差异和重置小时,scala,apache-spark,Scala,Apache Spark,我有以下两个专栏 import org.apache.spark.sql.types.{TimestampType, ArrayType} statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp")) statusWithOutDuplication.withColumn("responseTime"

我有以下两个专栏

import org.apache.spark.sql.types.{TimestampType, ArrayType}

statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))
我想将requestTime和responseTime传递到下面的UDF中,并在之后找到差异 将分钟和秒设置为“0”


Python中有“replace”(
startDateTime.replace(second=0,minute=0)
)Scala中的等价物是什么

您可以如下所示创建一个
UDF
,将值作为字符串发送,稍后再转换为
Timestamp
。在
UDF中

val timeDFiff = udf((start: String , end : String) => {
  //convert to timestamp and find the difference
})
并将其用作

df.withColumn("responseTime", timeDiff($"requestTime", $"responseTime"))
您可以使用内置的Spark函数,如

,而不是使用自定义项,您可以这样做:

import org.apache.spark.sql.types.{TimestampType, ArrayType}

statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS"))

//This resets minute and second to 0
def resetMinSec(colName: String) = {
    col(colName) - minute(col(colName).cast("TimeStamp"))*60 - second(col(colname).cast("Timestamp"))
}

//create a new column with the difference between unixtimes
statusWithOutDuplication.select((resetMinSec("responseTime") - resetMinSec("requestTime")).as("diff"))
请注意,我没有将
requestTime
/
responseTime
转换为“Timestamp”,您应该在找到差异后进行转换

udf方法应该类似,但使用一些scala方法从时间戳获取分/秒


希望这有点帮助

我想应用更多的逻辑并从中创建列表,所以我需要UDF。在将其转换为TimestampType之后,如何在UDF中找到差异,我想将分钟和秒重置为“0”,您希望在UDF中找到什么?是否要查找小时数/天数/周数/月数?
import org.apache.spark.sql.types.{TimestampType, ArrayType}

statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS"))

//This resets minute and second to 0
def resetMinSec(colName: String) = {
    col(colName) - minute(col(colName).cast("TimeStamp"))*60 - second(col(colname).cast("Timestamp"))
}

//create a new column with the difference between unixtimes
statusWithOutDuplication.select((resetMinSec("responseTime") - resetMinSec("requestTime")).as("diff"))