Apache spark 在整周的周开始日期(星期一)汇总

Apache spark 在整周的周开始日期(星期一)汇总,apache-spark,pyspark,apache-spark-sql,Apache Spark,Pyspark,Apache Spark Sql,在整周的周开始日期(星期一)汇总 窗口函数,我们不能在spark中为一周的聚合数据添加星期一作为开始日。或者任何与之相关的工作 df = spark.createDataFrame([ ("001", "event1", 10, "2016-05-01 10:50:51"), ("002", "event2", 100, "2016-05-02 10:50:53"), ("001", "event3", 20, "2016-05-03 10:50:55"), ("010", "e

在整周的周开始日期(星期一)汇总

窗口函数,我们不能在spark中为一周的聚合数据添加星期一作为开始日。或者任何与之相关的工作

df = spark.createDataFrame([
  ("001", "event1", 10, "2016-05-01 10:50:51"),
  ("002", "event2", 100, "2016-05-02 10:50:53"),
  ("001", "event3", 20, "2016-05-03 10:50:55"),
  ("010", "event3", 20, "2016-05-05 10:50:55"),
  ("001", "event1", 15, "2016-05-01 10:51:50"),
  ("003", "event1", 13, "2016-05-10 10:55:30"),
  ("001", "event2", 12, "2016-05-11 10:57:00"),
  ("001", "event3", 11, "2016-05-21 11:00:01"),
  ("002", "event2", 100, "2016-05-22 10:50:53"),
  ("001", "event3", 20, "2016-05-28 10:50:55"),
  ("001", "event1", 15, "2016-05-30 10:51:50"),
  ("003", "event1", 13, "2016-06-10 10:55:30"),
  ("001", "event2", 12, "2016-06-12 10:57:00"),
  ("001", "event3", 11, "2016-06-14 11:00:01")]).toDF("KEY", "Event_Type", "metric", "Time")

df2 = df.groupBy(window("Time", "7 day")).agg(sum("KEY").alias('aggregate_sum')).select("window.start", "window.end", "aggregate_sum").orderBy("window")

预期输出应为从周一开始一周的聚合数据。但是,spark本身从任何一天开始每周聚合7天。

Windows默认从1970-01-01开始,这是一个星期四。你可以用

window("Time", "7 day", startTime="4 days")

把它改到星期一。

非常感谢可能的重复。使用的时间与工作时间相同。分组依据(窗口(“时间”、“1周”、“1周”、“96小时”)