Python PySpark:从字符串中提取小时和分钟

Python PySpark:从字符串中提取小时和分钟,python,pyspark,databricks,Python,Pyspark,Databricks,我正在寻求帮助,了解如何从PySpark中的字符串中分别提取小时和分钟: df = spark.createDataFrame([['1325'], ['1433'], ['730']], ['time']) df = df.withColumn("time", to_timestamp("time")) # cast timestamp display(df) # example timestamp results time 1 1325-01-

我正在寻求帮助,了解如何从PySpark中的字符串中分别提取小时和分钟:

df = spark.createDataFrame([['1325'], ['1433'], ['730']], ['time'])
df = df.withColumn("time", to_timestamp("time"))  # cast timestamp
display(df)

# example timestamp results
  time
1 1325-01-01T00:00:00.000-0500
2 1433-01-01T00:00:00.000-0500
3 null
我不确定该怎么做,将其转换为unixtime、date和timestamp都不能很好地与这种类型的字符串数据配合

理想情况下,我希望它返回:

  time  hour  minutes
1 1325   13     25
2 1433   14     33
3 730    7      30

IIUC,您可以尝试的一种方法是使用模式
(?=\d\d$)
拆分字符串,然后从结果数组中提取小时/分钟:

from pyspark.sql import functions as F

df.withColumn('hm', F.split(F.lpad('time',4,'0'), '(?=\d\d$)')) \
    .selectExpr('time', 'int(hm[0]) as hour', 'int(hm[1]) as minutes') \
    .show()
+----+----+-------+
|time|hour|minutes|
+----+----+-------+
|1325|  13|     25|
|1433|  14|     33|
| 730|   7|     30|
|   2|   0|      2|
+----+----+-------+

太棒了,我想知道你是否能处理这样的边缘案例:
df=spark.createDataFrame([[1325],[1433],[730],[2'],[time'])
其中
2
将是午夜后2分钟。