Apache spark 如何在spark中使用Structured Streaming 2.3.0进行无状态聚合，而不使用flatMapsGroupWithState？_Apache Spark_Apache Spark Sql_Spark Structured Streaming

Apache spark 如何在spark中使用Structured Streaming 2.3.0进行无状态聚合，而不使用flatMapsGroupWithState？

apache-spark

Apache spark 如何在spark中使用Structured Streaming 2.3.0进行无状态聚合，而不使用flatMapsGroupWithState？,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,如何在spark中使用结构化流2.3.0进行无状态聚合，而不使用flatMapsGroupWithState或Dstream API？寻找一种更具声明性的方式例如：从一些视图中选择count（*）我希望输出只计算每个批中可用的任何记录，而不是前一批的聚合要在spark中使用Structured Streaming 2.3.0进行无状态聚合，而不使用flatMapsGroupWithState或Dstream API，可以使用以下代码- import spark.implicits._

如何在spark中使用结构化流2.3.0进行无状态聚合，而不使用flatMapsGroupWithState或Dstream API？寻找一种更具声明性的方式

例如：

从一些视图中选择count（*）

我希望输出只计算每个批中可用的任何记录，而不是前一批的聚合要在spark中使用Structured Streaming 2.3.0进行无状态聚合，而不使用

flatMapsGroupWithState

或Dstream API，可以使用以下代码-

import spark.implicits._

def countValues = (_: String, it: Iterator[(String, String)]) => it.length

val query =
  dataStream
    .select(lit("a").as("newKey"), col("value"))
    .as[(String, String)]
    .groupByKey { case(newKey, _) => newKey }
    .mapGroups[Int](countValues)
    .writeStream
    .format("console")
    .start()

我们现在做的是-

我们在

datastream

newKey

中添加了一列。我们这样做是为了使用

groupByKey

在上面执行

groupBy

。我使用了一个文本字符串

“a”

，但您可以使用任何东西。此外，您需要从

datastream

中的可用列中选择任何列。我已经选择了

value

列。为此，您可以选择任何人

我们创建了一个映射函数-

countValues

，通过编写

it.length

来计算由

groupByKey

函数聚合的值

因此，通过这种方式，我们可以计算每个批次中可用的任何记录，但不能从上一批次中聚合

我希望有帮助

我正在寻找问题中所述的声明性方式，因此我尝试使用原始sql字符串解决问题，这意味着没有映射函数，除非它们可以用作原始sql的一部分@user1870400我不熟悉任何声明方式。如果选择文字“a”，整个流不会成为一个组吗？@user1870400是的，它会。有没有办法在映射组内创建静态数据帧？假设mapGroups提供了行的迭代器。我只想用迭代器填充一个静态数据帧。可能吗？