Scala spark中按日期和小时列出的交叉表
样本DF:Scala spark中按日期和小时列出的交叉表,scala,apache-spark,Scala,Apache Spark,样本DF: var someDF = Seq( (1, "2017-12-02 03:04:00"), (1, "2017-12-02 03:45:00"), (1, "2017-12-02 04:04:00"), (2, "2017-12-02 04:14:00"), (2, "2017-12-02 04:54:00"), (3, "2017-10-01 11:45:20"), (4, "2017-10-01 02:45:20") ).toDF("number", "date") 作品: 当
var someDF = Seq(
(1, "2017-12-02 03:04:00"),
(1, "2017-12-02 03:45:00"),
(1, "2017-12-02 04:04:00"),
(2, "2017-12-02 04:14:00"),
(2, "2017-12-02 04:54:00"),
(3, "2017-10-01 11:45:20"),
(4, "2017-10-01 02:45:20")
).toDF("number", "date")
作品:
当我尝试使用交叉表时:
var temp = someDF.stat.crosstab("date","number")
temp.show()
作品:
我想应用相同的交叉表,但仅使用日期和时间,例如:2017-12-02 03:
预期OP:
+-------------------+---+---+---+---+
| date_Hour_number| 1| 2| 3| 4|
+-------------------+---+---+---+---+
|2017-10-01 11 | 0| 0| 1| 0|
|2017-12-02 03 . | 1| 0| 0| 0|
|2017-12-02 04 . | 0| 2| 0| 0|
任何建议都会有帮助因为您的
日期
列是字符串类型,您只需在应用交叉表之前使用子字符串
将日期
裁剪为小时
:
someDF.
withColumn("datehour", substring($"date", 0, 13)).
stat.crosstab("datehour", "number").
show
// +---------------+---+---+---+---+
// |datehour_number| 1| 2| 3| 4|
// +---------------+---+---+---+---+
// | 2017-10-01 02| 0| 0| 0| 1|
// | 2017-10-01 11| 0| 0| 1| 0|
// | 2017-12-02 04| 1| 2| 0| 0|
// | 2017-12-02 03| 2| 0| 0| 0|
// +---------------+---+---+---+---+
+-------------------+---+---+---+---+
| date_Hour_number| 1| 2| 3| 4|
+-------------------+---+---+---+---+
|2017-10-01 11 | 0| 0| 1| 0|
|2017-12-02 03 . | 1| 0| 0| 0|
|2017-12-02 04 . | 0| 2| 0| 0|
someDF.
withColumn("datehour", substring($"date", 0, 13)).
stat.crosstab("datehour", "number").
show
// +---------------+---+---+---+---+
// |datehour_number| 1| 2| 3| 4|
// +---------------+---+---+---+---+
// | 2017-10-01 02| 0| 0| 0| 1|
// | 2017-10-01 11| 0| 0| 1| 0|
// | 2017-12-02 04| 1| 2| 0| 0|
// | 2017-12-02 03| 2| 0| 0| 0|
// +---------------+---+---+---+---+