Python 从列值中获取毫秒,列值显示为“毫秒”;小时:分钟:秒。毫秒“;替身

Python 从列值中获取毫秒,列值显示为“毫秒”;小时:分钟:秒。毫秒“;替身,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一个这样的数据帧 test_df1 = spark.createDataFrame( [ (1, "-", "-"), (1, "97", "00:00:00.02"), (1, "78", "00:00:00.02"), (2, "83", "00:00:00.02&qu

我有一个这样的数据帧

test_df1 = spark.createDataFrame(
    [
        (1, "-", "-"),
        (1, "97", "00:00:00.02"),
        (1, "78", "00:00:00.02"),
        (2, "83", "00:00:00.02"),
        (2, "14", "00:00:00.02"),
        (2, "115", "00:00:00.02"),
    ],
    ['ID', 'random', 'time']
)
test_df1.show()

+---+------+-----------+
| ID|random|    time   |
+---+------+-----------+
|  1|     -|          -|
|  1|    97|00:00:00.02|
|  1|    78|00:00:00.02|
|  2|    83|00:00:00.02|
|  2|    14|00:00:00.02|
|  2|   115|00:00:00.02|
+---+------+-----------+
如何在doubletype中将
time
列转换为毫秒?我目前正在做下面所述的事情,在那里我得到秒后的数字作为字符串,然后将其转换为double。有更好的方法吗

test_df2 = test_df1.withColumn("time", F.substring_index("time", '.', -1).cast("double"))
test_df2.show()

+---+------+----+
| ID|random|time|
+---+------+----+
|  1|  null|null|
|  1|  97.0| 2.0|
|  1|  78.0| 2.0|
|  2|  83.0| 2.0|
|  2|  14.0| 2.0|
|  2| 115.0| 2.0|
+---+------+----+

我最后做的是将
时间
列转换为时间戳,然后转换为unix时间,之后我将今天的时间戳减少为unix时间。这给了我几秒钟的时间,我可以用它来得到ms或ns或其他东西

import datetime
from time import mktime

today = datetime.date.today()
unixtime = mktime(today.timetuple())

test_df1 = test_df1.withColumn('time_to_timestamp', to_timestamp('time')) \
                    .withColumn("unix_time_w_ms", col("time_to_timestamp").cast("double")) \
                    .withColumn("time_in_s", col("unix_time_w_ms") - unixtime) \
                    .withColumn("time_in_s", round(col('time_in_s'), 3))

test_df1.show()

+---+------+-----------+--------------------+---------------+---------+
| ID|random|       time|   time_to_timestamp| unix_time_w_ms|time_in_s|
+---+------+-----------+--------------------+---------------+---------+
|  1|     -|          -|                null|           null|     null|
|  1|    97|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  1|    78|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  2|    83|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  2|    14|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
|  2|   115|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9|     0.02|
+---+------+-----------+--------------------+---------------+---------+
我仍然有一种感觉,如果不使用大量的
with column
用法,这可能会做得更好,好像我必须使用巨大的数据帧来循环这一点,我已经读到
with column
用法不是首选用法。

这应该可以帮助您: