Apache spark Spark Structured Streaming—如何按最新计数和聚合计数进行重复数据消除_Apache Spark_Apache Spark Sql_Spark Structured Streaming

Apache spark Spark Structured Streaming—如何按最新计数和聚合计数进行重复数据消除

apache-spark

Apache spark Spark Structured Streaming—如何按最新计数和聚合计数进行重复数据消除,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,我想在一段时间内执行结构化流媒体聚合。给定以下数据模式。目标是根据用户的最新事件进行过滤。然后聚合每个位置的每个事件类型的计数 time location user type 1 A 1 one 2 A 1 two 1 B 2 one 2 B 2 one 1 A 3 two

我想在一段时间内执行结构化流媒体聚合。给定以下数据模式。目标是根据用户的最新事件进行过滤。然后聚合每个位置的每个事件类型的计数

time    location   user   type
 1        A         1      one
 2        A         1      two
 1        B         2      one
 2        B         2      one
 1        A         3      two
 1        A         4      one

样本输出：

location   countOne   countTwo
    A          1         2
    B          1         0

如下所示：

val aggTypes=df
。选择（$“位置”，“时间”，“用户”，“类型”）
.groupBy（$“用户”）
.agg（最大（$“时间戳”）作为“时间戳”）
。选择（“*”）
.withWatermark（“时间戳”，conf.kafka.watermark.toString+“秒”）
.groupBy（functions.window（$“timestamp”、DataConstant.t15min.toString+“seconds”、DataConstant.t1min.toString+“seconds”、$“location”）
.agg（计数（当（$“类型”==“一”，美元“类型”）为“计数一”，计数（当（$“类型”==“二”，美元“类型”为“计数二”））
.drop（$“窗口”）

由于结构化流不支持多个聚合，并且流数据帧/数据集不支持非时间窗口。我不确定是否可以在1个流查询中实现所需的输出

非常感谢您的帮助。

看起来您正在尝试进行无状态聚合。

flatMapGroups是一个聚合API，它将一个函数应用于数据集中的每个组。它仅在分组数据集上可用。flatMapGroups不支持会增加无序处理开销的部分聚合。因此，使用此API仅执行适合内存的小批量聚合。还建议使用reduce fun或聚合器。

val count = words.groupByKey(x => x)
            .flatMapGroups
             {
              case (x, iterator) ⇒ Iterator((x, iterator.length))
              }.toDF("x", "count")        


count.writeStream.format("console").outputMode(OutputMode.Append())