Apache spark 如何在spark中创建特定的时间框架

Apache spark 如何在spark中创建特定的时间框架,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我有跟踪器数据,我们存储跟踪器编号和到达时间戳 +---------+-------------------+ |trackerno| adate| +---------+-------------------+ | 54046022|2019-03-01 18:00:00| | 54030173|2019-03-01 17:45:00| | 53451324|2019-03-01 17:50:00| | 54002797|2019-03-01 18:30:00| |

我有跟踪器数据,我们存储跟踪器编号和到达时间戳

+---------+-------------------+
|trackerno|              adate|
+---------+-------------------+
| 54046022|2019-03-01 18:00:00|
| 54030173|2019-03-01 17:45:00|
| 53451324|2019-03-01 17:50:00|
| 54002797|2019-03-01 18:30:00|
| 53471705|2019-03-01 17:59:00|
我要17:44:59到17:59:59之间最后15分钟的数据。我正在使用spark应用程序

预期产出:

+---------+-------------------+
|trackerno|              adate|
+---------+-------------------+
| 54030173|2019-03-01 17:45:00|
| 53451324|2019-03-01 17:50:00|
| 53471705|2019-03-01 17:59:00|

您可以尝试以下方法:

  val df = Seq(
    (54046022, "2019-03-01 18:00:00"),
    (54030173, "2019-03-01 17:45:00"),
    (53451324, "2019-03-01 17:50:00"),
    (54002797, "2019-03-01 18:30:00"),
    (53471705, "2019-03-01 17:59:00")
  ).toDF("trackerno", "date")

  val tsDF = df.withColumn("ts", to_timestamp($"date"))

  val result = tsDF .
    select($"trackerno", $"date").
    where($"ts" >= to_timestamp(lit("2019-03-01 17:44:59")) &&
      $"ts" <= to_timestamp(lit("2019-03-01 17:59:59")))

  result.show(false)

你的问题不太清楚,特别是你将如何测量15分钟窗口的开始和结束时间。我只是根据自己的理解来回答这个问题

创建一个时间为15分钟的窗口

from pyspark.sql.functions import window
grouped_window = df.groupBy(window("adate", "15 minutes"),"trackerno","adate").count()
这会给你带来这样的结果

+------------------------------------------+---------+-------------------+-----+
|window                                    |trackerno|adate              |count|
+------------------------------------------+---------+-------------------+-----+
|[2019-03-01 17:45:00, 2019-03-01 18:00:00]|53451324 |2019-03-01 17:50:00|1    |
|[2019-03-01 18:30:00, 2019-03-01 18:45:00]|54002797 |2019-03-01 18:30:00|1    |
|[2019-03-01 17:45:00, 2019-03-01 18:00:00]|53471705 |2019-03-01 17:59:00|1    |
|[2019-03-01 18:00:00, 2019-03-01 18:15:00]|54046022 |2019-03-01 18:00:00|1    |
|[2019-03-01 17:45:00, 2019-03-01 18:00:00]|54030173 |2019-03-01 17:45:00|1    |
+------------------------------------------+---------+-------------------+-----+

from pyspark.sql import functions as f
from pyspark.sql import Window
w = Window.partitionBy('window')

grouped_window.select('adate', 'trackerno', f.count('count').over(w).alias('dupeCount')).sort('adate')\
    .where('dupeCount > 1')\
    .drop('dupeCount')\
    .show()

+-------------------+---------+
|              adate|trackerno|
+-------------------+---------+
|2019-03-01 17:45:00| 54030173|
|2019-03-01 17:50:00| 53451324|
|2019-03-01 17:59:00| 53471705|
+-------------------+---------+

我希望每0-24小时一次。每次窗口15分钟都保持静止。我的意思是,它总是有一个特定的开始和结束时间,因为15分钟的窗口可以在0-24小时内的任何时间落下。请指定更多详细信息,以便添加您迄今为止尝试过的内容。听起来像是没有任何尝试的作业问题\.v_df.distinct.withColumntimestamp,to_timestamp unix_timestamp coladate.withColumnDate,date_formatcoltimestamp,yyyyy-MM-dd.withColumntime,date_formatColumnTimestamp,HH:MM:ss.withColumnmydata,whenminute$time.介于44,59,1.0.show之间
df.where(minute($"ts")>=45)