Scala Spark中的小时聚合
我正在寻找一种按小时汇总数据的方法。首先,我想在我的生活中只保留几个小时。我的数据框如下所示:Scala Spark中的小时聚合,scala,datetime,apache-spark,apache-spark-sql,spark-dataframe,Scala,Datetime,Apache Spark,Apache Spark Sql,Spark Dataframe,我正在寻找一种按小时汇总数据的方法。首先,我想在我的生活中只保留几个小时。我的数据框如下所示: +-------+-----------------------+-----------+ |reqUser|evtTime |event_count| +-------+-----------------------+-----------+ |X166814|2018-01-01 11:23:06.426|1 | |X166815|2018-01-0
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:23:06.426|1 |
|X166815|2018-01-01 02:20:06.426|2 |
|X166816|2018-01-01 11:25:06.429|5 |
|X166817|2018-02-01 10:23:06.429|1 |
|X166818|2018-01-01 09:23:06.430|3 |
|X166819|2018-01-01 10:15:06.430|8 |
|X166820|2018-08-01 11:00:06.431|20 |
|X166821|2018-03-01 06:23:06.431|7 |
|X166822|2018-01-01 07:23:06.434|2 |
|X166823|2018-01-01 11:23:06.434|1 |
+-------+-----------------------+-----------+
我的目标是得到这样的东西:
+-------+-----------------------+-----------+
|reqUser|evtTime |event_count|
+-------+-----------------------+-----------+
|X166814|2018-01-01 11:00:00.000|1 |
|X166815|2018-01-01 02:00:00.000|2 |
|X166816|2018-01-01 11:00:00.000|5 |
|X166817|2018-02-01 10:00:00.000|1 |
|X166818|2018-01-01 09:00:00.000|3 |
|X166819|2018-01-01 10:00:00.000|8 |
|X166820|2018-08-01 11:00:00.000|20 |
|X166821|2018-03-01 06:00:00.000|7 |
|X166822|2018-01-01 07:00:00.000|2 |
|X166823|2018-01-01 11:00:00.000|1 |
+-------+-----------------------+-----------+
我使用的是scala 2.10.5和spark 1.6.3。我随后的目标是按reqUser分组并计算事件计数的总和。我试过这个:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{round, sum}
val new_df = df
.groupBy($"reqUser",
Window(col("evtTime"), "1 hour"))
.agg(sum("event_count") as "aggregate_sum")
这是我的错误消息:
Error:(81, 15) org.apache.spark.sql.expressions.Window.type does not take parameters
Window(col("time"), "1 hour"))
帮忙?谢谢 在Spark 1.x中,您可以使用格式工具
import org.apache.spark.sql.functions.trunc
val df = Seq("2018-01-01 10:15:06.430").toDF("evtTime")
df.select(date_format($"evtTime".cast("timestamp"), "yyyy-MM-dd HH:00:00")).show
+------------------------------------------------------------+
|date_format(CAST(evtTime AS TIMESTAMP), yyyy-MM-dd HH:00:00)|
+------------------------------------------------------------+
| 2018-01-01 10:00:00|
+------------------------------------------------------------+