如何使用Scala筛选日期列并将其作为数字存储在数据框中
我有一个数据帧(dateds1),如下所示如何使用Scala筛选日期列并将其作为数字存储在数据框中,scala,dataframe,apache-spark-sql,rdd,Scala,Dataframe,Apache Spark Sql,Rdd,我有一个数据帧(dateds1),如下所示 +-----------+-----------+-------------------+-------------------+ |DateofBirth|JoiningDate| Contract Date| ReleaseDate| +-----------+-----------+-------------------+-------------------+ | 1995/09/16| 2008/09/09|2009-
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 1995/09/16| 2008/09/09|2009-02-09 00:00:00|2017-09-09 00:00:00|
| 1994/09/20| 2008/09/10|1999-05-05 00:00:00|2016-09-30 00:00:00|
| 1993/09/24| 2016/06/29|2003-12-07 00:00:00|2028-02-13 00:00:00|
| 1992/09/28| 2007/06/24|2004-06-05 00:00:00|2019-09-24 00:00:00|
| 1991/10/03| 2011/07/07|2011-07-07 00:00:00|2020-03-30 00:00:00|
| 1990/10/07| 2009/02/09|2009-02-09 00:00:00|2011-03-13 00:00:00|
| 1989/10/11| 1999/05/05|1999-05-05 00:00:00|2021-03-13 00:00:00|
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 19950916 | 20080909 |20090209 |20170909 |
| 19940920 | 20080910 |19990505 |20160930 |
| 19930924 | 20160629 |20031207 |20280213 |
| 19920928 | 20070624 |20040605 |20190924 |
| 19911003 | 20110707 |20110707 |20200330 |
| 19901007 | 20090209 |20090209 |20110313 |
| 19891011 | 19990505 |19990505 |20210313 |
我需要帮助过滤出来,我的输出应该如下所示
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 1995/09/16| 2008/09/09|2009-02-09 00:00:00|2017-09-09 00:00:00|
| 1994/09/20| 2008/09/10|1999-05-05 00:00:00|2016-09-30 00:00:00|
| 1993/09/24| 2016/06/29|2003-12-07 00:00:00|2028-02-13 00:00:00|
| 1992/09/28| 2007/06/24|2004-06-05 00:00:00|2019-09-24 00:00:00|
| 1991/10/03| 2011/07/07|2011-07-07 00:00:00|2020-03-30 00:00:00|
| 1990/10/07| 2009/02/09|2009-02-09 00:00:00|2011-03-13 00:00:00|
| 1989/10/11| 1999/05/05|1999-05-05 00:00:00|2021-03-13 00:00:00|
+-----------+-----------+-------------------+-------------------+
|DateofBirth|JoiningDate| Contract Date| ReleaseDate|
+-----------+-----------+-------------------+-------------------+
| 19950916 | 20080909 |20090209 |20170909 |
| 19940920 | 20080910 |19990505 |20160930 |
| 19930924 | 20160629 |20031207 |20280213 |
| 19920928 | 20070624 |20040605 |20190924 |
| 19911003 | 20110707 |20110707 |20200330 |
| 19901007 | 20090209 |20090209 |20110313 |
| 19891011 | 19990505 |19990505 |20210313 |
我尝试使用过滤器,但我只能过滤其中一种情况,即日期为YYYY/MM/DD或YYYY-MM-DD 00:00:00格式且列数固定。有人能帮我弄清楚这两种格式以及列数何时是动态的(它们可能在增加或减少)。
它们应该以YYYYMMDD格式从日期数据类型转换为整数或Long
注意:此数据帧中的记录或YYYY/MM/DD或YYYY-MM-DD 00:00:00格式的记录。
感谢您的帮助。感谢要动态地进行转换,您必须遍历所有列,并根据列类型执行不同的操作 下面是一个例子:
import java.sql.Date
import org.apache.spark.sql.types._
import java.sql.Timestamp
val originalDf = Seq(
(Timestamp.valueOf("2016-09-30 03:04:00"),Date.valueOf("2016-09-30")),
(Timestamp.valueOf("2016-07-30 00:00:00"),Date.valueOf("2016-10-30"))
).toDF("ts_value","date_value")
原始表格详情:
> originalDf.show
+-------------------+----------+
| ts_value|date_value|
+-------------------+----------+
|2016-09-30 03:04:00|2016-09-30|
|2016-07-30 00:00:00|2016-10-30|
+-------------------+----------+
> originalDf.printSchema
root
|-- ts_value: timestamp (nullable = true)
|-- date_value: date (nullable = true)
newDf.show
+--------+----------+
|ts_value|date_value|
+--------+----------+
|20160930| 20160930|
|20160730| 20161030|
+--------+----------+
newDf.printSchema
root
|-- ts_value: integer (nullable = true)
|-- date_value: integer (nullable = true)
转换操作示例:
val newDf = originalDf.columns.foldLeft(originalDf)((df, name) => {
val data_type = df.schema(name).dataType
if(data_type == DateType)
df.withColumn(name, date_format(col(name), "yyyyMMdd").cast(IntegerType))
else if(data_type == TimestampType)
df.withColumn(name, year(col(name))*10000 + month(col(name))*100 + dayofmonth(col(name)))
else
df
})
新表格详情:
> originalDf.show
+-------------------+----------+
| ts_value|date_value|
+-------------------+----------+
|2016-09-30 03:04:00|2016-09-30|
|2016-07-30 00:00:00|2016-10-30|
+-------------------+----------+
> originalDf.printSchema
root
|-- ts_value: timestamp (nullable = true)
|-- date_value: date (nullable = true)
newDf.show
+--------+----------+
|ts_value|date_value|
+--------+----------+
|20160930| 20160930|
|20160730| 20161030|
+--------+----------+
newDf.printSchema
root
|-- ts_value: integer (nullable = true)
|-- date_value: integer (nullable = true)
如果不想在所有列中执行此操作,可以通过更改
val newDf = originalDf.columns.foldLeft ...
到
希望这有帮助 非常感谢,安德烈。你是最棒的。我有点怀疑,如果日期是2019/03/03,它会将其作为字符串而不是日期类型。对于这种日期格式(YYYY/MM/DD),在else中可以包含什么样的条件。你能给我一些建议吗?别担心,安德烈,我想出来了。我做了这个“else if(data_type==StringType)df.with column(name,concat(substring)(col(name),1,4),substring(col(name),6,2),substring(col(name),9,2))”