Apache spark “中的数据集使用情况”；MapGroupsWithState“；sparksql的实现_Apache Spark_Apache Spark Sql_Spark Structured Streaming

Apache spark “中的数据集使用情况”；MapGroupsWithState“；sparksql的实现

apache-spark

Apache spark “中的数据集使用情况”；MapGroupsWithState“；sparksql的实现,apache-spark,apache-spark-sql,spark-structured-streaming,Apache Spark,Apache Spark Sql,Spark Structured Streaming,我有带有“id和Map[字符串，列表]”数据的事件。我正在按id对这些数据进行分组。然后我用“mapgroupswithstate”计算一些东西我可以在mapgroupswithstate中使用from_json（）方法吗？那么，我可以在mapgroupswithstate中使用dataset/dataframe 比如, df.groupBy().mapgroupswithstate{ val anotherDF = events.toDF ... other operations

我有带有

“id和Map[字符串，列表]”

数据的事件。我正在按

id

对这些数据进行分组。然后我用“mapgroupswithstate”计算一些东西

我可以在

mapgroupswithstate

中使用

from_json（）

方法吗？那么，我可以在

mapgroupswithstate

中使用

dataset/dataframe

比如,

df.groupBy().mapgroupswithstate{
   val anotherDF = events.toDF
   ... other operations...
}

我可以在mapgroupswithstate中使用from_json（）方法吗？那么，我可以在mapgroupswithstate中使用dataset/dataframe吗

答案-两个问题的答案都是否定的（松散地）。不是以标准的方式。

当您在mapgroupswithstate中操作时，您将进入执行器级别的操作，在那里您可以编写自定义代码而无需数据帧抽象

还有，我想再问一个问题。当我使用mapgroupswithstate而不使用“timeout”（也是它的更新模式）时，状态会无限增长吗？例如，我有100个用户，我为这些用户保留100个状态。每个状态将使用每个接收到的数据进行更新。在这种情况下，所有状态都会无限增长？在不超时的情况下，对每个组状态的更新应该在调用流式查询时保存。我没有亲自尝试过。