PySpark时间戳毫秒数
我试图得到两个时间戳列之间的差异,但是毫秒已经消失了 如何纠正这一点PySpark时间戳毫秒数,pyspark,Pyspark,我试图得到两个时间戳列之间的差异,但是毫秒已经消失了 如何纠正这一点 from pyspark.sql.functions import unix_timestamp timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS" data = [ (1, '2018-07-25 17:15:06.39','2018-07-25 17:15:06.377'), (2,'2018-07-25 11:12:49.317','2018-07-25 11:12:48.883
from pyspark.sql.functions import unix_timestamp
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
data = [
(1, '2018-07-25 17:15:06.39','2018-07-25 17:15:06.377'),
(2,'2018-07-25 11:12:49.317','2018-07-25 11:12:48.883')
]
df = spark.createDataFrame(data, ['ID', 'max_ts','min_ts']).withColumn('diff',F.unix_timestamp('max_ts', format=timeFmt) - F.unix_timestamp('min_ts', format=timeFmt))
df.show(truncate = False)
这是
unix\u timestamp
的预期行为-它在中明确指出,它只返回秒,因此在执行计算时,毫秒组件被删除
如果要进行该计算,可以使用substring
函数计算数字,然后进行差异计算。请参见下面的示例。请注意,这假设数据格式完整,例如毫秒完全满足(全部3位):
当值的类型为
timestamp
且毫秒为整数(如390500)时,来自Tanjin的答案不起作用。Python将在末尾剪切0
,示例中的时间戳如下所示2018-07-25 17:15:06.39
问题在于F.substring('max_ts',3,3)
中的硬编码值。如果结尾处的0
缺失,则子字符串将变为野生
要将tmpColumn
类型的timestamp
列转换为tmpLongColumn
类型的long
,我使用了以下代码段:
timeFmt = "yyyy-MM-dd HH:mm:ss.SSS"
df = df \
.withColumn('tmpLongColumn', F.substring_index('tmpColumn', '.', -1).cast('float')) \
.withColumn('tmpLongColumn', F.when(F.col('tmpLongColumn') < 100, F.col('tmpLongColumn')*10).otherwise(F.col('tmpLongColumn')).cast('long')) \
.withColumn('tmpLongColumn', (F.unix_timestamp('tmpColumn', format=timeFmt)*1000 + F.col('tmpLongColumn'))) \
timeFmt=“yyyy-MM-dd HH:MM:ss.SSS”
df=df\
.withColumn('tmpLongColumn',F.substring_index('tmpColumn','.',-1).cast('float'))\
.withColumn('tmpLongColumn',F.when(F.col('tmpLongColumn')<100,F.col('tmpLongColumn')*10)。否则(F.col('tmpLongColumn')).cast('long'))\
.withColumn('tmpLongColumn',(F.unix_时间戳('tmpColumn',格式=timeFmt)*1000+F.col('tmpLongColumn'))\
第一个转换提取包含毫秒的子字符串。接下来,如果该值小于100,则将其乘以10。最后,转换时间戳并添加毫秒。与@kaichi不同,我没有发现尾随的零被substring\u index
命令截断,因此毫秒乘以10是不必要的,可能会给出错误的答案,例如,如果毫秒最初是099,这将变成990。此外,您可能还希望添加具有零毫秒的时间戳的处理。为了处理这两种情况,我修改了@kaichi的答案,以毫秒为单位给出两个时间戳之间的差值:
df = (
df
.withColumn('tmpLongColumn', f.substring_index(tmpColumn, '.', -1).cast('long'))
.withColumn(
'tmpLongColumn',
f.when(f.col('tmpLongColumn').isNull(), 0.0)
.otherwise(f.col('tmpLongColumn')))
.withColumn(
tmpColumn,
(f.unix_timestamp(tmpColumn, format=timeFmt)*1000 + f.col('tmpLongColumn')))
.drop('tmpLongColumn'))
Reason pyspark to_timestamp只解析到秒,而TimestampType具有保持毫秒的能力
以下解决方法可能有效:
如果时间戳模式包含S,则调用UDF以获取要在表达式中使用的字符串“INTERVAL millizes”
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
if S in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
要获得256毫秒的间隔,我们可以使用Java UDF:
df=df.withColumn(col_name,df[col_name]+expr(getIntervalStringUDF(df[my_col_name],ts_pattern)))
UDF内部:getIntervalStringUDF(字符串时间字符串,字符串模式)
使用SimpleDataFormat根据模式解析日期
使用模式“'INTERVAL'SSS'毫秒”将格式化日期返回为字符串
对分析/格式异常返回“间隔0毫秒”
请参考假设您已经有一个带有时间戳类型列的数据帧:
从日期时间导入日期时间
数据=[
(1,日期时间(2018,7,25,17,15,639000),日期时间(2018,7,25,17,15,6377000),
(2,日期时间(2018,7,25,11,12,49317000),日期时间(2018,7,25,11,12,48883000))
]
df=spark.createDataFrame(数据,['ID'、'max\u ts'、'min\u ts')
df.printSchema()
#根
#|--ID:long(nullable=true)
#|--max_ts:timestamp(nullable=true)
#|--min_ts:时间戳(nullable=true)
您可以通过将timestamp type列转换为double
类型以秒为单位获得时间,或者通过将该结果乘以1000(如果需要整数,还可以选择转换为long
)以毫秒为单位获得时间。
比如说
df.select(
F.col('max_ts')。cast('double')。别名('time_in_seconds'),
(F.col('max_ts').cast('double')*1000.cast('long').alias('time_in_毫秒')),
).toPandas()
#时间单位为秒时间单位为毫秒
# 0 1532538906.390 1532538906390
# 1 1532517169.317 1532517169317
最后,如果希望以毫秒为单位计算两次之间的差值,可以执行以下操作:
df.select(
((F.col('max_ts').cast('double')-F.col('min_ts').cast('double'))*1000)。cast('long')。别名('diff_in_毫秒'),
).toPandas()
#以毫秒为单位的差异
# 0 13
# 1 434
我在PySpark 2.4.2上做这个。无需使用任何字符串连接。当您无法保证子秒的精确格式(长度?尾随零?)时,我提出以下小算法,该算法适用于所有长度和格式:
算法
根据亚秒字符串的长度,计算适当的除数(10为子字符串长度的幂)
之后删除多余的列应该不是问题
示范
我的示例结果如下所示:
+----------------------+----------------+----------------+-------+----------+----------------+
|time |subsecond_string|subsecond_length|divisor|subseconds|timestamp_subsec|
+----------------------+----------------+----------------+-------+----------+----------------+
|2019-04-02 14:34:16.02|02 |2 |100.0 |0.02 |1.55420845602E9 |
|2019-04-02 14:34:16.03|03 |2 |100.0 |0.03 |1.55420845603E9 |
|2019-04-02 14:34:16.04|04 |2 |100.0 |0.04 |1.55420845604E9 |
|2019-04-02 14:34:16.05|05 |2 |100.0 |0.05 |1.55420845605E9 |
|2019-04-02 14:34:16.06|06 |2 |100.0 |0.06 |1.55420845606E9 |
|2019-04-02 14:34:16.07|07 |2 |100.0 |0.07 |1.55420845607E9 |
|2019-04-02 14:34:16.08|08 |2 |100.0 |0.08 |1.55420845608E9 |
|2019-04-02 14:34:16.09|09 |2 |100.0 |0.09 |1.55420845609E9 |
|2019-04-02 14:34:16.1 |1 |1 |10.0 |0.1 |1.5542084561E9 |
|2019-04-02 14:34:16.11|11 |2 |100.0 |0.11 |1.55420845611E9 |
|2019-04-02 14:34:16.12|12 |2 |100.0 |0.12 |1.55420845612E9 |
|2019-04-02 14:34:16.13|13 |2 |100.0 |0.13 |1.55420845613E9 |
|2019-04-02 14:34:16.14|14 |2 |100.0 |0.14 |1.55420845614E9 |
|2019-04-02 14:34:16.15|15 |2 |100.0 |0.15 |1.55420845615E9 |
|2019-04-02 14:34:16.16|16 |2 |100.0 |0.16 |1.55420845616E9 |
|2019-04-02 14:34:16.17|17 |2 |100.0 |0.17 |1.55420845617E9 |
|2019-04-02 14:34:16.18|18 |2 |100.0 |0.18 |1.55420845618E9 |
|2019-04-02 14:34:16.19|19 |2 |100.0 |0.19 |1.55420845619E9 |
|2019-04-02 14:34:16.2 |2 |1 |10.0 |0.2 |1.5542084562E9 |
|2019-04-02 14:34:16.21|21 |2 |100.0 |0.21 |1.55420845621E9 |
+----------------------+----------------+----------------+-------+----------+----------------+
请检查您是否可以使用建议的解决方案以获得帮助
timeFmt = "yyyy-MM-dd' 'HH:mm:ss.SSS"
current_col = "time"
df = df.withColumn("subsecond_string", F.substring_index(current_col, '.', -1))
df = df.withColumn("subsecond_length", F.length(F.col("subsecond_string")))
df = df.withColumn("divisor", F.pow(10,"subsecond_length"))
df = df.withColumn("subseconds", F.col("subsecond_string").cast("int") / F.col("divisor") )
# Putting it all together
df = df.withColumn("timestamp_subsec", F.unix_timestamp(current_col, format=timeFmt) + F.col("subseconds"))
+----------------------+----------------+----------------+-------+----------+----------------+
|time |subsecond_string|subsecond_length|divisor|subseconds|timestamp_subsec|
+----------------------+----------------+----------------+-------+----------+----------------+
|2019-04-02 14:34:16.02|02 |2 |100.0 |0.02 |1.55420845602E9 |
|2019-04-02 14:34:16.03|03 |2 |100.0 |0.03 |1.55420845603E9 |
|2019-04-02 14:34:16.04|04 |2 |100.0 |0.04 |1.55420845604E9 |
|2019-04-02 14:34:16.05|05 |2 |100.0 |0.05 |1.55420845605E9 |
|2019-04-02 14:34:16.06|06 |2 |100.0 |0.06 |1.55420845606E9 |
|2019-04-02 14:34:16.07|07 |2 |100.0 |0.07 |1.55420845607E9 |
|2019-04-02 14:34:16.08|08 |2 |100.0 |0.08 |1.55420845608E9 |
|2019-04-02 14:34:16.09|09 |2 |100.0 |0.09 |1.55420845609E9 |
|2019-04-02 14:34:16.1 |1 |1 |10.0 |0.1 |1.5542084561E9 |
|2019-04-02 14:34:16.11|11 |2 |100.0 |0.11 |1.55420845611E9 |
|2019-04-02 14:34:16.12|12 |2 |100.0 |0.12 |1.55420845612E9 |
|2019-04-02 14:34:16.13|13 |2 |100.0 |0.13 |1.55420845613E9 |
|2019-04-02 14:34:16.14|14 |2 |100.0 |0.14 |1.55420845614E9 |
|2019-04-02 14:34:16.15|15 |2 |100.0 |0.15 |1.55420845615E9 |
|2019-04-02 14:34:16.16|16 |2 |100.0 |0.16 |1.55420845616E9 |
|2019-04-02 14:34:16.17|17 |2 |100.0 |0.17 |1.55420845617E9 |
|2019-04-02 14:34:16.18|18 |2 |100.0 |0.18 |1.55420845618E9 |
|2019-04-02 14:34:16.19|19 |2 |100.0 |0.19 |1.55420845619E9 |
|2019-04-02 14:34:16.2 |2 |1 |10.0 |0.2 |1.5542084562E9 |
|2019-04-02 14:34:16.21|21 |2 |100.0 |0.21 |1.55420845621E9 |
+----------------------+----------------+----------------+-------+----------+----------------+