Apache spark 在Spark结构化流媒体中使用mapGroupsWithState时出错_Apache Spark_Apache Spark Sql_Spark Structured Streaming

Apache spark 在Spark结构化流媒体中使用mapGroupsWithState时出错

apache-spark

Apache spark 在Spark结构化流媒体中使用mapGroupsWithState时出错,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,当我使用窗口操作应用mapGroupsWithState的结果以获取多个字段的聚合计数时，出现错误输入遵循以下模式，其中可以有许多具有不同时间戳和状态值的相同id的事件 root |-- id: string (nullable = true) |-- location: string (nullable = true) |-- timestamp: timestamp (nullable = true) |-- state: int (nullable = true) 例如：通过使用ma

当我使用窗口操作应用mapGroupsWithState的结果以获取多个字段的聚合计数时，出现错误

输入遵循以下模式，其中可以有许多具有不同时间戳和状态值的相同id的事件

root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- state: int (nullable = true)

例如：

通过使用mapGroupsWithState，我只为每个id保留最新出现的时间戳。结果架构是相同的，但不会有重复的id，并且每行都包含最新的事件

root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- state: int (nullable = true)

上述事件的结果如下：

event("abc", "a", 2, 2)
event("def", "b", 2, 1)
event("ghi", "b", 1, 1)

最后，我应用groupby窗口操作来聚合位置中每个唯一状态的计数，以获得以下模式：

root
 |-- location: string (nullable = true)
 |-- state1: long (nullable = false)
 |-- state2: long (nullable = false)

查询如下所示：

val aggDemand=df
。选择（$“id”、$“位置”、$“时间戳”、$“状态”）
.带水印（“时间戳”，“10秒”）
.groupBy（functions.window（$“timestamp”、DataConstant.t15min.toString+“seconds”、DataConstant.t1min.toString+“seconds”）、$“location”）
.agg（计数（当（$“状态”==1L，$“状态”）为“状态1时，计数（当（$“状态”==2L，$“状态”）为“状态2时）
.filter（unix_时间戳（$“window.end”.cast（TimestampType））unix_时间戳（从utc时间戳（当前时间戳（），“utc+08:00”））
.drop（$“窗口”）

针对来自kafka的流式数据帧/数据集运行时，我遇到以下错误：

Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;;

目的是获得以下结果：

location | state 1 | state 2
-----------------------------
    a    |    0    |    1
    b    |    2    |    0

该方法适用于批处理模式，但对于流式查询似乎失败。 查询有什么问题，如何获得所需的结果？在执行窗口操作之前，是否需要存储来自mapGroupsWithState的结果

感谢您的帮助

在Struct Stream上有许多限制。它不是Spark Streaming的替代方案

在spark streaming中，您可以在mapWithState函数中以相同的结果完成您的问题

检查此链接

        case m: FlatMapGroupsWithState if m.isStreaming =>

      // Check compatibility with output modes and aggregations in query
      val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)

      if (m.isMapGroupsWithState) {                       // check mapGroupsWithState
        // allowed only in update query output mode and without aggregation
        if (aggsAfterFlatMapGroups.nonEmpty) {
          throwError(
            "mapGroupsWithState is not supported with aggregation " +
              "on a streaming DataFrame/Dataset")
        } else if (outputMode != InternalOutputModes.Update) {
          throwError(
            "mapGroupsWithState is not supported with " +
              s"$outputMode output mode on a streaming DataFrame/Dataset")
        }

结构流上有许多限制。它不是Spark Streaming的替代方案

在spark streaming中，您可以在mapWithState函数中以相同的结果完成您的问题

检查此链接

        case m: FlatMapGroupsWithState if m.isStreaming =>

      // Check compatibility with output modes and aggregations in query
      val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)

      if (m.isMapGroupsWithState) {                       // check mapGroupsWithState
        // allowed only in update query output mode and without aggregation
        if (aggsAfterFlatMapGroups.nonEmpty) {
          throwError(
            "mapGroupsWithState is not supported with aggregation " +
              "on a streaming DataFrame/Dataset")
        } else if (outputMode != InternalOutputModes.Update) {
          throwError(
            "mapGroupsWithState is not supported with " +
              s"$outputMode output mode on a streaming DataFrame/Dataset")
        }

您使用的spark版本是什么？您使用的spark版本是什么？