如何在PySpark中生成两个日期之间的小时时间戳?

如何在PySpark中生成两个日期之间的小时时间戳?,pyspark,pyspark-sql,Pyspark,Pyspark Sql,考虑这个示例数据帧 data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))] df = spark.createDataFrame(data, ["minDate", "maxDate"]) df.show() +-------------------+-------------------+ | minDate| maxDate| +--------------

考虑这个示例数据帧

data = [(dt.datetime(2000,1,1,15,20,37), dt.datetime(2000,1,1,19,12,22))]
df = spark.createDataFrame(data, ["minDate", "maxDate"])
df.show()
+-------------------+-------------------+
|            minDate|            maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
+-------------------+-------------------+
我想把这两个日期分解成一个小时的时间序列,就像

+-------------------+-------------------+
|            minDate|            maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 16:00:00|
|2000-01-01 16:01:00|2000-01-01 17:00:00|
|2000-01-01 17:01:00|2000-01-01 18:00:00|
|2000-01-01 18:01:00|2000-01-01 19:00:00|
|2000-01-01 19:01:00|2000-01-01 19:12:22|
+-------------------+-------------------+
您对如何在不使用UDF的情况下实现这一目标有何建议


谢谢

这就是我最终解决问题的方法

输入数据

数据=[
(dt.datetime(2000,1,1,15,20,37),dt.datetime(2000,1,1,19,12,22)),
(日期时间(2001,1,1,15,20,37),日期时间(2001,1,1,18,12,22))
]
df=spark.createDataFrame(数据,[“minDate”,“maxDate”])
df.show()
导致

+-------------------+-------------------+
|            minDate|            maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|
+-------------------+-------------------+
+-------------------+-------------------+-------------------+-------------------+
|            minDate|            maxDate|           start_dt|             end_dt|
+-------------------+-------------------+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 15:20:37|2000-01-01 15:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 16:00:00|2000-01-01 16:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 17:00:00|2000-01-01 17:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 18:00:00|2000-01-01 18:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 19:00:00|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 15:20:37|2001-01-01 15:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 16:00:00|2001-01-01 16:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 17:00:00|2001-01-01 17:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 18:00:00|2001-01-01 18:12:22|
+-------------------+-------------------+-------------------+-------------------+
转换数据

#计算最小日期和最大日期之间的小时数
df=df.withColumn(
“时差”,
fn.ceil((fn.col('maxDate').cast('long')-fn.col('minDate').cast('long'))/3600)
)
#重复行的次数等于小时差
df=df.withColumn(“repeat”,fn.expr(“split(repeat,repeat,hour_diff),”,“,”))\
.select(“*”,fn.posexplode(“repeat”).alias(“idx”,“val”))\
.drop(“repeat”、“val”)\
.withColumn('hour_add'),(fn.col('minDate').cast('long')+fn.col('idx')*3600.cast('timestamp'))
#根据边界创建新的开始和结束日期
df=(df
.withColumn(
“开始”,
fn.何时(
fn.col('idx')>0,
(fn.楼层(fn.col('hour_add').cast('long')/3600)*3600.cast('timestamp'))
).否则(fn.col('minDate'))
).withColumn(
“结束”,
fn.何时(
fn.col('idx')!=fn.col('hour_diff'),
(fn.ceil(fn.col('hour_add').cast('long')/3600)*3600-60).cast('timestamp'))
)。否则(fn.col('maxDate'))
).drop('hour\u diff','idx','hour\u add'))
df.show()
导致

+-------------------+-------------------+
|            minDate|            maxDate|
+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|
+-------------------+-------------------+
+-------------------+-------------------+-------------------+-------------------+
|            minDate|            maxDate|           start_dt|             end_dt|
+-------------------+-------------------+-------------------+-------------------+
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 15:20:37|2000-01-01 15:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 16:00:00|2000-01-01 16:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 17:00:00|2000-01-01 17:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 18:00:00|2000-01-01 18:59:00|
|2000-01-01 15:20:37|2000-01-01 19:12:22|2000-01-01 19:00:00|2000-01-01 19:12:22|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 15:20:37|2001-01-01 15:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 16:00:00|2001-01-01 16:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 17:00:00|2001-01-01 17:59:00|
|2001-01-01 15:20:37|2001-01-01 18:12:22|2001-01-01 18:00:00|2001-01-01 18:12:22|
+-------------------+-------------------+-------------------+-------------------+

您可能会自动生成时间戳,然后使用unixtime格式将整数转换为日期!(我还没有试过。我没有检查哪个解决方案是最好的)谢谢你的评论。我最终会尝试一下。如果你碰巧在我之前做了这件事,请随意发布一个新的答案!