Apache spark 在Spark结构化流媒体中使用mapGroupsWithState时出错
当我使用窗口操作应用mapGroupsWithState的结果以获取多个字段的聚合计数时,出现错误 输入遵循以下模式,其中可以有许多具有不同时间戳和状态值的相同id的事件Apache spark 在Spark结构化流媒体中使用mapGroupsWithState时出错,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,当我使用窗口操作应用mapGroupsWithState的结果以获取多个字段的聚合计数时,出现错误 输入遵循以下模式,其中可以有许多具有不同时间戳和状态值的相同id的事件 root |-- id: string (nullable = true) |-- location: string (nullable = true) |-- timestamp: timestamp (nullable = true) |-- state: int (nullable = true) 例如: 通过使用ma
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- state: int (nullable = true)
例如:
通过使用mapGroupsWithState,我只为每个id保留最新出现的时间戳。结果架构是相同的,但不会有重复的id,并且每行都包含最新的事件
root
|-- id: string (nullable = true)
|-- location: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- state: int (nullable = true)
上述事件的结果如下:
event("abc", "a", 2, 2)
event("def", "b", 2, 1)
event("ghi", "b", 1, 1)
最后,我应用groupby窗口操作来聚合位置中每个唯一状态的计数,以获得以下模式:
root
|-- location: string (nullable = true)
|-- state1: long (nullable = false)
|-- state2: long (nullable = false)
查询如下所示:
val aggDemand=df
。选择($“id”、$“位置”、$“时间戳”、$“状态”)
.带水印(“时间戳”,“10秒”)
.groupBy(functions.window($“timestamp”、DataConstant.t15min.toString+“seconds”、DataConstant.t1min.toString+“seconds”)、$“location”)
.agg(计数(当($“状态”==1L,$“状态”)为“状态1时,计数(当($“状态”==2L,$“状态”)为“状态2时)
.filter(unix_时间戳($“window.end”.cast(TimestampType))unix_时间戳(从utc时间戳(当前时间戳(),“utc+08:00”))
.drop($“窗口”)
针对来自kafka的流式数据帧/数据集运行时,我遇到以下错误:
Exception in thread "main" org.apache.spark.sql.AnalysisException: mapGroupsWithState is not supported with aggregation on a streaming DataFrame/Dataset;;
目的是获得以下结果:
location | state 1 | state 2
-----------------------------
a | 0 | 1
b | 2 | 0
该方法适用于批处理模式,但对于流式查询似乎失败。
查询有什么问题,如何获得所需的结果?在执行窗口操作之前,是否需要存储来自mapGroupsWithState的结果
感谢您的帮助 在Struct Stream上有许多限制。它不是Spark Streaming的替代方案 在spark streaming中,您可以在mapWithState函数中以相同的结果完成您的问题 检查此链接
case m: FlatMapGroupsWithState if m.isStreaming =>
// Check compatibility with output modes and aggregations in query
val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)
if (m.isMapGroupsWithState) { // check mapGroupsWithState
// allowed only in update query output mode and without aggregation
if (aggsAfterFlatMapGroups.nonEmpty) {
throwError(
"mapGroupsWithState is not supported with aggregation " +
"on a streaming DataFrame/Dataset")
} else if (outputMode != InternalOutputModes.Update) {
throwError(
"mapGroupsWithState is not supported with " +
s"$outputMode output mode on a streaming DataFrame/Dataset")
}
结构流上有许多限制。它不是Spark Streaming的替代方案 在spark streaming中,您可以在mapWithState函数中以相同的结果完成您的问题 检查此链接
case m: FlatMapGroupsWithState if m.isStreaming =>
// Check compatibility with output modes and aggregations in query
val aggsAfterFlatMapGroups = collectStreamingAggregates(plan)
if (m.isMapGroupsWithState) { // check mapGroupsWithState
// allowed only in update query output mode and without aggregation
if (aggsAfterFlatMapGroups.nonEmpty) {
throwError(
"mapGroupsWithState is not supported with aggregation " +
"on a streaming DataFrame/Dataset")
} else if (outputMode != InternalOutputModes.Update) {
throwError(
"mapGroupsWithState is not supported with " +
s"$outputMode output mode on a streaming DataFrame/Dataset")
}
您使用的spark版本是什么?您使用的spark版本是什么?