Scala中的时间类型差异和重置小时
我有以下两个专栏Scala中的时间类型差异和重置小时,scala,apache-spark,Scala,Apache Spark,我有以下两个专栏 import org.apache.spark.sql.types.{TimestampType, ArrayType} statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp")) statusWithOutDuplication.withColumn("responseTime"
import org.apache.spark.sql.types.{TimestampType, ArrayType}
statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS").cast("Timestamp"))
我想将requestTime和responseTime传递到下面的UDF中,并在之后找到差异
将分钟和秒设置为“0”
Python中有“replace”(
startDateTime.replace(second=0,minute=0)
)Scala中的等价物是什么 您可以如下所示创建一个UDF
,将值作为字符串发送,稍后再转换为Timestamp
。在UDF中
val timeDFiff = udf((start: String , end : String) => {
//convert to timestamp and find the difference
})
并将其用作
df.withColumn("responseTime", timeDiff($"requestTime", $"responseTime"))
您可以使用内置的Spark函数,如,而不是使用自定义项,您可以这样做:
import org.apache.spark.sql.types.{TimestampType, ArrayType}
statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS"))
//This resets minute and second to 0
def resetMinSec(colName: String) = {
col(colName) - minute(col(colName).cast("TimeStamp"))*60 - second(col(colname).cast("Timestamp"))
}
//create a new column with the difference between unixtimes
statusWithOutDuplication.select((resetMinSec("responseTime") - resetMinSec("requestTime")).as("diff"))
请注意,我没有将requestTime
/responseTime
转换为“Timestamp”,您应该在找到差异后进行转换
udf方法应该类似,但使用一些scala方法从时间戳获取分/秒
希望这有点帮助 我想应用更多的逻辑并从中创建列表,所以我需要UDF。在将其转换为TimestampType之后,如何在UDF中找到差异,我想将分钟和秒重置为“0”,您希望在UDF中找到什么?是否要查找小时数/天数/周数/月数?
import org.apache.spark.sql.types.{TimestampType, ArrayType}
statusWithOutDuplication.withColumn("requestTime", unix_timestamp( col("requestTime"), "YYYY-MM-DD HH:MM:SS"))
statusWithOutDuplication.withColumn("responseTime", unix_timestamp( col("responseTime"), "YYYY-MM-DD HH:MM:SS"))
//This resets minute and second to 0
def resetMinSec(colName: String) = {
col(colName) - minute(col(colName).cast("TimeStamp"))*60 - second(col(colname).cast("Timestamp"))
}
//create a new column with the difference between unixtimes
statusWithOutDuplication.select((resetMinSec("responseTime") - resetMinSec("requestTime")).as("diff"))