Java 在聚合其他数据时是否计数？_Java_Apache Spark_Streaming

Java 在聚合其他数据时是否计数？

java apache-spark streaming

Java 在聚合其他数据时是否计数？,java,apache-spark,streaming,Java,Apache Spark,Streaming,以下是我的数据集的外观： +---------+------------+-----------------+ | name |request_type| request_group_id| +---------+------------+-----------------+ |Michael | X | 1020 | |Michael | X | 1018 | |Joe | Y

以下是我的数据集的外观：

+---------+------------+-----------------+
|  name   |request_type| request_group_id|
+---------+------------+-----------------+
|Michael  |     X      |  1020           |
|Michael  |     X      |  1018           |
|Joe      |     Y      |  1018           |
|Sam      |     X      |  1018           |
|Michael  |     Y      |  1021           |
|Sam      |     X      |  1030           |
|Elizabeth|     Y      |  1035           |
+---------+------------+-----------------+

我想计算每个人的

请求类型

，并计算唯一

请求组id

结果如下：

+---------+--------------------+---------------------+--------------------------------+
|  name   |cnt(request_type(X))| cnt(request_type(Y))| cnt(distinct(request_group_id))|
+---------+--------------------+---------------------+--------------------------------+
|Michael  |          2         |         1           |      3                         |
|Joe      |          0         |         1           |      1                         |
|Sam      |          2         |         0           |      2                         |
|John     |          1         |         0           |      1                         |
|Elizabeth|          0         |         1           |      1                         |
+---------+--------------------+---------------------+--------------------------------+

到目前为止我所做的：（帮助导出前两列）

如何在此选择中统计不同的

请求\u组\u id

？有可能在它里面做吗

我认为只有通过两个数据集连接才有可能（我当前的结果+通过distinct

request\u group\u id

单独聚合）

带有“countDistinct”的示例（“countDistinct”不在窗口上工作，替换为“size”，“collect\u set”）：

functions.countDistinct？@pasha701如果我们用此函数补充当前聚合，它将按

和

计数计算唯一组id，因此简单的进一步求和不会解决初始问题。我想按照

名称

执行此不同的组id计数。。。如果我错了，请纠正我。

msgDataFrame.select(NAME, REQUEST_TYPE)
            .groupBy(NAME)
            .pivot(REQUEST_TYPE, Lists.newArrayList(X, Y))
            .agg(functions.count(REQUEST_TYPE))
            .show();

val groupIdWindow = Window.partitionBy("name")
df.select($"name", $"request_type",
      size(collect_set("request_group_id").over(groupIdWindow)).alias("countDistinct"))
  .groupBy("name", "countDistinct")
  .pivot($"request_type", Seq("X", "Y"))
  .agg(count("request_type"))
  .show(false)