Python PySpark:从字符串中提取小时和分钟
我正在寻求帮助,了解如何从PySpark中的字符串中分别提取小时和分钟:Python PySpark:从字符串中提取小时和分钟,python,pyspark,databricks,Python,Pyspark,Databricks,我正在寻求帮助,了解如何从PySpark中的字符串中分别提取小时和分钟: df = spark.createDataFrame([['1325'], ['1433'], ['730']], ['time']) df = df.withColumn("time", to_timestamp("time")) # cast timestamp display(df) # example timestamp results time 1 1325-01-
df = spark.createDataFrame([['1325'], ['1433'], ['730']], ['time'])
df = df.withColumn("time", to_timestamp("time")) # cast timestamp
display(df)
# example timestamp results
time
1 1325-01-01T00:00:00.000-0500
2 1433-01-01T00:00:00.000-0500
3 null
我不确定该怎么做,将其转换为unixtime、date和timestamp都不能很好地与这种类型的字符串数据配合
理想情况下,我希望它返回:
time hour minutes
1 1325 13 25
2 1433 14 33
3 730 7 30
IIUC,您可以尝试的一种方法是使用模式
(?=\d\d$)
拆分字符串,然后从结果数组中提取小时/分钟:
from pyspark.sql import functions as F
df.withColumn('hm', F.split(F.lpad('time',4,'0'), '(?=\d\d$)')) \
.selectExpr('time', 'int(hm[0]) as hour', 'int(hm[1]) as minutes') \
.show()
+----+----+-------+
|time|hour|minutes|
+----+----+-------+
|1325| 13| 25|
|1433| 14| 33|
| 730| 7| 30|
| 2| 0| 2|
+----+----+-------+
太棒了,我想知道你是否能处理这样的边缘案例:
df=spark.createDataFrame([[1325],[1433],[730],[2'],[time'])
其中2
将是午夜后2分钟。