Apache spark 数据帧中时间戳列的分区_Apache Spark_Dataframe_Timestamp_Pyspark_Partition

Apache spark 数据帧中时间戳列的分区

apache-spark dataframe pyspark

Apache spark 数据帧中时间戳列的分区,apache-spark,dataframe,timestamp,pyspark,partition,Apache Spark,Dataframe,Timestamp,Pyspark,Partition,我在PSspark中有一个数据帧，格式如下 Date Id Name Hours Dno Dname 12/11/2013 1 sam 8 102 It 12/10/2013 2 Ram 7 102 It 11/10/2013 3 Jack 8 103 Accounts 12/11/2013 4 Jim 9 101 Marketing 我想基于dno进行分区，并使用拼

我在PSspark中有一个

数据帧

，格式如下

Date        Id  Name    Hours   Dno Dname
12/11/2013  1   sam     8       102 It
12/10/2013  2   Ram     7       102 It
11/10/2013  3   Jack    8       103 Accounts
12/11/2013  4   Jim     9       101 Marketing

我想基于

dno

进行分区，并使用拼花格式将其另存为配置单元中的表

df.write.saveAsTable(
'default.testing'，mode='overwrite'，partitionBy='Dno'，format='parquet'）

查询工作正常，在配置单元中创建了带有拼花输入的表

现在，我想根据日期列的年份和月份进行分区。时间戳是Unix时间戳

我们如何在PySpark中实现这一点。我已经在蜂巢中完成了，但无法完成PySpark>=3.1

不要使用

cast

而使用

timestamp\u seconds

from pyspark.sql.functions import timestamp_seconds

year(timestamp_seconds(col("timestamp")))

火花<3.1

只需提取要使用的字段，并向编写器的

partitionBy

提供列列表作为参数。如果

timestamp

是以秒表示的UNIX时间戳：

df=sc.parallelize([
（1484810378，1，“sam”，8102，“It”），
（1484815300，2，“ram”，7103，“账户”）
]).toDF（[“时间戳”、“id”、“名称”、“小时数”、“dno”、“dname”]）

添加列：

从pyspark.sql.functions导入年、月、列
带有年和月的df_=（df
.withColumn（“年”），year（col（“时间戳”）.cast（“时间戳”））
.withColumn（“月”），month（col（“时间戳”）.cast（“时间戳”））

并写下：

（带年和月的df）
写
.分割人（“年”、“月”）
.mode（“覆盖”）
.格式（“拼花地板”）
.saveAsTable（“default.testing”））

火花>=3.1