Apache spark 带有groupby计数的火花过滤器数据_Apache Spark_Apache Spark Sql

Apache spark 带有groupby计数的火花过滤器数据

apache-spark

Apache spark 带有groupby计数的火花过滤器数据,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,数据帧A_df类似于： +------+----+-----+ | uid|year|month| +------+----+-----+ | 1|2017| 03| 1|2017| 05| | 2|2017| 01| | 3|2017| 02| | 3|2017| 04| | 3|2017| 05| +------+----+-----+ 我需要出现时间超过2次的筛选器列uid，预期结果： +------+--

数据帧A_df类似于：

+------+----+-----+
|   uid|year|month|
+------+----+-----+
|     1|2017|   03|
      1|2017|   05|
|     2|2017|   01|
|     3|2017|   02|
|     3|2017|   04|
|     3|2017|   05|
+------+----+-----+

我需要出现时间超过2次的筛选器列uid，预期结果：

+------+----+-----+
|   uid|year|month|
+------+----+-----+
|     3|2017|   02|
|     3|2017|   04|
|     3|2017|   05|
+------+----+-----+

如何通过Scala获得此结果？我的解决方案：

val条件\u uid=A\u df.groupBy（“uid”）
.agg（计数（“*”）。别名（“cnt”））
.过滤器（“cnt>2”）。选择（“uid”）
val results\u df=A\u df.join（条件uid，Seq（“uid”））

有更好的答案吗？

我认为使用窗口函数是一个完美的解决方案，因为您不必重新加入数据帧

val window = Window.partitionBy("uid").orderBy("year")

df.withColumn("count", count("uid").over(window))
  .filter($"count" > 2).drop("count").show

输出：

+---+----+-----+-----+
|uid|year|month|count|
+---+----+-----+-----+
|  1|2017|   03|    2|
|  1|2017|   05|    2|
|  2|2017|   01|    1|
+---+----+-----+-----+

我认为使用窗口函数是一个完美的解决方案，因为您不必重新加入数据帧

val window = Window.partitionBy("uid").orderBy("year")

df.withColumn("count", count("uid").over(window))
  .filter($"count" > 2).drop("count").show

输出：

+---+----+-----+-----+
|uid|year|month|count|
+---+----+-----+-----+
|  1|2017|   03|    2|
|  1|2017|   05|    2|
|  2|2017|   01|    1|
+---+----+-----+-----+