在pyspark中将时间戳转换为特定日期_Pyspark_Type Conversion_Timestamp_Converter

在pyspark中将时间戳转换为特定日期

pyspark

在pyspark中将时间戳转换为特定日期,pyspark,type-conversion,timestamp,converter,Pyspark,Type Conversion,Timestamp,Converter,我想在特定列上转换特定日期的时间戳以下是我的意见： +----------+ |时间戳| +----------+ |1532383202| +----------+ 我所期望的是： +------------------+ |日期| +------------------+ |24/7/2018 1:00:00 | +------------------+ 如果可能的话，我想把分和秒设置为0，即使它不是0 例如，如果我有： +------------------+ |日期| +-----

我想在特定列上转换特定日期的时间戳

以下是我的意见：

+----------+
|时间戳|
+----------+
|1532383202|
+----------+

我所期望的是：

+------------------+
|日期|
+------------------+
|24/7/2018 1:00:00 |
+------------------+

如果可能的话，我想把分和秒设置为0，即使它不是0

例如，如果我有：

+------------------+
|日期|
+------------------+
|24/7/2018 1:06:32 |
+------------------+

我想这样：

+------------------+
|日期|
+------------------+
|24/7/2018 1:00:00 |
+------------------+

我尝试的是：

从pyspark.sql.functions导入unix\u时间戳
table=table.withColumn(
“时间戳”，
unix\u时间戳（日期格式（'timestamp'，'yyyy-MM-dd-HH:MM:SS'））
)

但是我有空值。

也许您可以使用datetime库将时间戳转换为您想要的格式。您还应该使用用户定义的函数来处理spark DF列。下面是我要做的：

#导入库
从pyspark.sql.functions导入udf
从日期时间导入日期时间
#创建从时间戳返回所需字符串的函数
def格式_时间戳（ts）：
return datetime.fromtimestamp（ts）.strftime（“%Y-%m-%d%H:00:00”）
#创建UDF
格式\时间戳\ udf=udf（λx：格式\时间戳（x））
#最后，将函数应用于“timestamp”列的每个元素
table=table.withColumn（'timestamp'，format_timestamp_udf（table['timestamp']））

希望这有帮助。

也许您可以使用datetime库将时间戳转换为您想要的格式。您还应该使用用户定义的函数来处理spark DF列。下面是我要做的：

#导入库
从pyspark.sql.functions导入udf
从日期时间导入日期时间
#创建从时间戳返回所需字符串的函数
def格式_时间戳（ts）：
return datetime.fromtimestamp（ts）.strftime（“%Y-%m-%d%H:00:00”）
#创建UDF
格式\时间戳\ udf=udf（λx：格式\时间戳（x））
#最后，将函数应用于“timestamp”列的每个元素
table=table.withColumn（'timestamp'，format_timestamp_udf（table['timestamp']））

希望这有帮助。

更新
受@Tony Pellerin回答的启发，我意识到您可以直接转到
：00:00
，而无需使用
regexp\u replace（）
：

table=table.withColumn（“日期”，f.from_unixtime（“时间戳”，“dd/MM/yyyy HH:00”））表.show（） #+----------+-------------------+ #|时间戳|日期| #+----------+-------------------+ #|1532383202|23/07/2018 18:00:00| #+----------+-------------------+

您的代码不起作用，因为将：
使用默认时区和默认区域设置，将具有给定模式的时间字符串（“yyyy-MM-dd HH:MM:ss”，默认情况下）转换为Unix时间戳（以秒为单位），如果失败，则返回null
实际上，你需要做这个操作的相反操作，也就是。为此，您可以使用：

导入pyspark.sql.f函数 table=table.withColumn（“日期”，f.from_unixtime（“时间戳”，“dd/MM/yyyy HH:MM:SS”））表.show（） #+----------+-------------------+ #|时间戳|日期| #+----------+-------------------+ #|1532383202|23/07/2018 18:07:00| #+----------+-------------------+
现在，
date
列是一个字符串：

table.printSchema（） #根 #|--timestamp:long（nullable=true） #|--date:string（nullable=true）
因此，您可以使用
pyspark.sql.functions.regexp\u replace（）
将分和秒设置为零：

table.withColumn（“date”，f.regexp_replace（“date”，即“：\d{2}:\d{2}”，“00:00”））.show（） #+----------+-------------------+ #|时间戳|日期| #+----------+-------------------+ #|1532383202|23/07/2018 18:00:00| #+----------+-------------------+

正则表达式模式
“：\d{2}”
意味着匹配一个文本
：
，后跟两位数字。
更新
受@Tony Pellerin回答的启发，我意识到您可以直接转到
：00:00
，而无需使用
regexp\u replace（）
：

table=table.withColumn（“日期”，f.from_unixtime（“时间戳”，“dd/MM/yyyy HH:00”））表.show（） #+----------+-------------------+ #|时间戳|日期| #+----------+-------------------+ #|1532383202|23/07/2018 18:00:00| #+----------+-------------------+

您的代码不起作用，因为将：
使用默认时区和默认区域设置，将具有给定模式的时间字符串（“yyyy-MM-dd HH:MM:ss”，默认情况下）转换为Unix时间戳（以秒为单位），如果失败，则返回null
实际上，你需要做这个操作的相反操作，也就是。为此，您可以使用：

导入pyspark.sql.f函数 table=table.withColumn（“日期”，f.from_unixtime（“时间戳”，“dd/MM/yyyy HH:MM:SS”））表.show（） #+----------+-------------------+ #|时间戳|日期| #+----------+-------------------+ #|1532383202|23/07/2018 18:07:00| #+----------+-------------------+
现在，
date
列是一个字符串：

table.printSchema（） #根 #|--timestamp:long（nullable=true） #|--date:string（nullable=true）
因此，您可以使用
pyspark.sql.functions.regexp\u replace（）
将分和秒设置为零：

table.withColumn（“date”，f.regexp_replace（“date”，即“：\d{2}:\d{2}”，“00:00”））.show（） #+----------+-------------------+ #|时间戳|日期| #+----------+-------------------+ #|1532383202|23/07/2018 18:00:00| #+----------+-------------------+
正则表达式模式
“：\d{2}”
意味着匹配一个文本
：
，后跟正好两个数字。
G