Python 从列值中获取毫秒,列值显示为“毫秒”;小时:分钟:秒。毫秒“;替身
我有一个这样的数据帧Python 从列值中获取毫秒,列值显示为“毫秒”;小时:分钟:秒。毫秒“;替身,python,apache-spark,pyspark,Python,Apache Spark,Pyspark,我有一个这样的数据帧 test_df1 = spark.createDataFrame( [ (1, "-", "-"), (1, "97", "00:00:00.02"), (1, "78", "00:00:00.02"), (2, "83", "00:00:00.02&qu
test_df1 = spark.createDataFrame(
[
(1, "-", "-"),
(1, "97", "00:00:00.02"),
(1, "78", "00:00:00.02"),
(2, "83", "00:00:00.02"),
(2, "14", "00:00:00.02"),
(2, "115", "00:00:00.02"),
],
['ID', 'random', 'time']
)
test_df1.show()
+---+------+-----------+
| ID|random| time |
+---+------+-----------+
| 1| -| -|
| 1| 97|00:00:00.02|
| 1| 78|00:00:00.02|
| 2| 83|00:00:00.02|
| 2| 14|00:00:00.02|
| 2| 115|00:00:00.02|
+---+------+-----------+
如何在doubletype中将time
列转换为毫秒?我目前正在做下面所述的事情,在那里我得到秒后的数字作为字符串,然后将其转换为double。有更好的方法吗
test_df2 = test_df1.withColumn("time", F.substring_index("time", '.', -1).cast("double"))
test_df2.show()
+---+------+----+
| ID|random|time|
+---+------+----+
| 1| null|null|
| 1| 97.0| 2.0|
| 1| 78.0| 2.0|
| 2| 83.0| 2.0|
| 2| 14.0| 2.0|
| 2| 115.0| 2.0|
+---+------+----+
我最后做的是将
时间
列转换为时间戳,然后转换为unix时间,之后我将今天的时间戳减少为unix时间。这给了我几秒钟的时间,我可以用它来得到ms或ns或其他东西
import datetime
from time import mktime
today = datetime.date.today()
unixtime = mktime(today.timetuple())
test_df1 = test_df1.withColumn('time_to_timestamp', to_timestamp('time')) \
.withColumn("unix_time_w_ms", col("time_to_timestamp").cast("double")) \
.withColumn("time_in_s", col("unix_time_w_ms") - unixtime) \
.withColumn("time_in_s", round(col('time_in_s'), 3))
test_df1.show()
+---+------+-----------+--------------------+---------------+---------+
| ID|random| time| time_to_timestamp| unix_time_w_ms|time_in_s|
+---+------+-----------+--------------------+---------------+---------+
| 1| -| -| null| null| null|
| 1| 97|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9| 0.02|
| 1| 78|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9| 0.02|
| 2| 83|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9| 0.02|
| 2| 14|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9| 0.02|
| 2| 115|00:00:00.02|2020-11-20 00:00:...|1.60583040002E9| 0.02|
+---+------+-----------+--------------------+---------------+---------+
我仍然有一种感觉,如果不使用大量的with column
用法,这可能会做得更好,好像我必须使用巨大的数据帧来循环这一点,我已经读到with column
用法不是首选用法。这应该可以帮助您: