Warning: file_get_contents(/data/phpspider/zhask/data//catemap/3/apache-spark/6.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Scala 如何根据一列中某个值的出现情况过滤spark数据帧,条件为日期列?_Scala_Apache Spark_Apache Spark Sql - Fatal编程技术网

Scala 如何根据一列中某个值的出现情况过滤spark数据帧,条件为日期列?

Scala 如何根据一列中某个值的出现情况过滤spark数据帧,条件为日期列?,scala,apache-spark,apache-spark-sql,Scala,Apache Spark,Apache Spark Sql,团队,我正在处理一个数据帧,看起来: df client | date C1 |08-NOV-18 11.29.43 C2 |09-NOV-18 13.29.43 C2 |09-NOV-18 18.29.43 C3 |11-NOV-18 19.29.43 C1 |12-NOV-18 10.29.43 C2 |13-NOV-18 09.29.43

团队,我正在处理一个数据帧,看起来:

    df
    client   | date   
      C1     |08-NOV-18 11.29.43
      C2     |09-NOV-18 13.29.43
      C2     |09-NOV-18 18.29.43
      C3     |11-NOV-18 19.29.43
      C1     |12-NOV-18 10.29.43
      C2     |13-NOV-18 09.29.43
      C4     |14-NOV-18 20.29.43
      C1     |15-NOV-18 11.29.43
      C5     |16-NOV-18 15.29.43
      C10    |17-NOV-18 19.29.43
      C1     |18-NOV-18 12.29.43
      C2     |18-NOV-18 10.29.43
      C2     |19-NOV-18 09.29.43
      C6     |20-NOV-18 13.29.43
      C6     |21-NOV-18 14.29.43
      C1     |21-NOV-18 18.29.43
      C1     |22-NOV-18 11.29.43
我的目标是筛选此数据帧,并获取包含每个客户端最后两次出现的新数据帧,如果此事件<24小时,例如,对于此示例,结果必须为:

     client  |date
      C2     |18-NOV-18 10.29.43
      C2     |19-NOV-18 09.29.43
      C1     |21-NOV-18 18.29.43
      C1     |22-NOV-18 11.29.43

请帮忙

对于这种情况,我有一个解决方案:

  val milliSecForADay = 24 * 60 * 60 * 1000

  val filterDatesUDF = udf { arr: scala.collection.mutable.WrappedArray[Timestamp] =>
    arr.sortWith(_ after _).toList match {
      case last :: secondLast :: _ if (last.getTime - secondLast.getTime) < milliSecForADay => Array(secondLast, last)
      case _ => Array.empty[Timestamp]
    }
  }

  val finalDF = df.groupBy("client")
    .agg(collect_list("date").as("dates"))
    .select(col("client"), explode(filterDatesUDF(col("dates"))).as("date"))
    .show()

使用窗口函数。看看这个:

val df = Seq(("C1","08-NOV-18 11.29.43"),
  ("C2","09-NOV-18 13.29.43"),
  ("C2","09-NOV-18 18.29.43"),
  ("C3","11-NOV-18 19.29.43"),
  ("C1","12-NOV-18 10.29.43"),
  ("C2","13-NOV-18 09.29.43"),
  ("C4","14-NOV-18 20.29.43"),
  ("C1","15-NOV-18 11.29.43"),
  ("C5","16-NOV-18 15.29.43"),
  ("C10","17-NOV-18 19.29.43"),
  ("C1","18-NOV-18 12.29.43"),
  ("C2","18-NOV-18 10.29.43"),
  ("C2","19-NOV-18 09.29.43"),
  ("C6","20-NOV-18 13.29.43"),
  ("C6","21-NOV-18 14.29.43"),
  ("C1","21-NOV-18 18.29.43"),
  ("C1","22-NOV-18 11.29.43")).toDF("client","dt").withColumn("dt",from_unixtime(unix_timestamp('dt,"dd-MMM-yy HH.mm.ss"),"yyyy-MM-dd HH:mm:ss"))

df.createOrReplaceTempView("tbl")

val df2 = spark.sql(""" select * from ( select client, dt, count(*) over(partition by client ) cnt, rank() over(partition by client order by dt desc) rk1  from tbl ) t where cnt>1 and rk1 in (1,2) """)

df2.alias("t1").join(df2.alias("t2"), $"t1.client" === $"t2.client" and $"t1.rk1" =!= $"t2.rk1" , "inner" ).withColumn("dt24",(unix_timestamp($"t1.dt") - unix_timestamp($"t2.dt") )/ 3600 ).where("dt24 > -24 and dt24 < 24").select($"t1.client", $"t1.dt").show(false)

使用窗口功能可以找到下一个/上一个日期,然后过滤日期之间差异大于24小时的行

数据准备

val df = Seq(("C1", "08-NOV-18 11.29.43"),
  ("C2", "09-NOV-18 13.29.43"),
  ("C2", "09-NOV-18 18.29.43"),
  ("C3", "11-NOV-18 19.29.43"),
  ("C1", "12-NOV-18 10.29.43"),
  ("C2", "13-NOV-18 09.29.43"),
  ("C4", "14-NOV-18 20.29.43"),
  ("C1", "15-NOV-18 11.29.43"),
  ("C5", "16-NOV-18 15.29.43"),
  ("C10", "17-NOV-18 19.29.43"),
  ("C1", "18-NOV-18 12.29.43"),
  ("C2", "18-NOV-18 10.29.43"),
  ("C2", "19-NOV-18 09.29.43"),
  ("C6", "20-NOV-18 13.29.43"),
  ("C6", "21-NOV-18 14.29.43"),
  ("C1", "21-NOV-18 18.29.43"),
  ("C1", "22-NOV-18 11.29.43"))
  .toDF("client", "dt")
  .withColumn("dt", to_timestamp($"dt", "dd-MMM-yy HH.mm.ss"))
代理代码

// get next/prev dates
val dateWindow = Window.partitionBy("client").orderBy("dt")
val withNextPrevDates = df
  .withColumn("previousDate", lag($"dt", 1).over(dateWindow))
  .withColumn("nextDate", lead($"dt", 1).over(dateWindow))

// function for filter
val secondsInDay = TimeUnit.DAYS.toSeconds(1)
val dateDiffLessThanDay = (startTimeStamp: Column, endTimeStamp: Column) =>
  endTimeStamp.cast(LongType) - startTimeStamp.cast(LongType) < secondsInDay && datediff(endTimeStamp, startTimeStamp) === 1

// filter
val result = withNextPrevDates
  .where(dateDiffLessThanDay($"previousDate", $"dt") || dateDiffLessThanDay($"dt", $"nextDate"))
  .drop("previousDate", "nextDate")

你的问题不清楚。你到底想要什么作为输出<代码>日期列的类型是什么?hello@BalajiReddy我编辑了这个问题,我想得到一个数据框,其中包含每个客户的最后两个观察结果,即日期的差异小于24小时。@anujsaxena日期列是一个时间戳。
val df = Seq(("C1", "08-NOV-18 11.29.43"),
  ("C2", "09-NOV-18 13.29.43"),
  ("C2", "09-NOV-18 18.29.43"),
  ("C3", "11-NOV-18 19.29.43"),
  ("C1", "12-NOV-18 10.29.43"),
  ("C2", "13-NOV-18 09.29.43"),
  ("C4", "14-NOV-18 20.29.43"),
  ("C1", "15-NOV-18 11.29.43"),
  ("C5", "16-NOV-18 15.29.43"),
  ("C10", "17-NOV-18 19.29.43"),
  ("C1", "18-NOV-18 12.29.43"),
  ("C2", "18-NOV-18 10.29.43"),
  ("C2", "19-NOV-18 09.29.43"),
  ("C6", "20-NOV-18 13.29.43"),
  ("C6", "21-NOV-18 14.29.43"),
  ("C1", "21-NOV-18 18.29.43"),
  ("C1", "22-NOV-18 11.29.43"))
  .toDF("client", "dt")
  .withColumn("dt", to_timestamp($"dt", "dd-MMM-yy HH.mm.ss"))
// get next/prev dates
val dateWindow = Window.partitionBy("client").orderBy("dt")
val withNextPrevDates = df
  .withColumn("previousDate", lag($"dt", 1).over(dateWindow))
  .withColumn("nextDate", lead($"dt", 1).over(dateWindow))

// function for filter
val secondsInDay = TimeUnit.DAYS.toSeconds(1)
val dateDiffLessThanDay = (startTimeStamp: Column, endTimeStamp: Column) =>
  endTimeStamp.cast(LongType) - startTimeStamp.cast(LongType) < secondsInDay && datediff(endTimeStamp, startTimeStamp) === 1

// filter
val result = withNextPrevDates
  .where(dateDiffLessThanDay($"previousDate", $"dt") || dateDiffLessThanDay($"dt", $"nextDate"))
  .drop("previousDate", "nextDate")
+------+-------------------+
|client|dt                 |
+------+-------------------+
|C1    |2018-11-21 18:29:43|
|C1    |2018-11-22 11:29:43|
|C2    |2018-11-18 10:29:43|
|C2    |2018-11-19 09:29:43|
+------+-------------------+