Spark Scala使用聚合函数查找组中列值的出现次数_Scala_Apache Spark_Window_Partition

Spark Scala使用聚合函数查找组中列值的出现次数

scala apache-spark

Spark Scala使用聚合函数查找组中列值的出现次数,scala,apache-spark,window,partition,Scala,Apache Spark,Window,Partition,我有以下资料： group_id id name ---- -- ---- G1 1 apple G1 2 orange G1 3 apple G1 4 banana G1 5 apple G2 6 orange G2 7 apple G2 8 apple 我想找出每组中的uniqe发生数。到目前

我有以下资料：

group_id    id  name
----        --  ----
G1          1   apple
G1          2   orange
G1          3   apple
G1          4   banana
G1          5   apple
G2          6   orange
G2          7   apple
G2          8   apple

我想找出每组中的uniqe发生数。到目前为止，我已经做到了这一点

val group = Window.partitionBy("group_id")
newdf.withColumn("name_appeared_count", approx_count_distinct($"name").over(group))

我想要这样的结果：

group_id    id  name   name_appeared_count
----        --  ----   -------------------
G1          1   apple       3
G1          2   orange      1
G1          3   apple       3
G1          4   banana      1
G1          5   apple       3
G2          6   orange      1
G2          7   apple       2
G2          8   apple       2

提前谢谢

Method

approx\u count\u distinct（$“name”）。超过（组）

计算每个组的distinct

name

，因此这不是基于预期输出所需的。在

分区（“组id”、“名称”）

上使用

计数（“名称”）

将生成所需的计数：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq(
  ("G1", 1, "apple"),
  ("G1", 2, "orange"),
  ("G1", 3, "apple"),
  ("G1", 4, "banana"),
  ("G1", 5, "apple"),
  ("G2", 6, "orange"),
  ("G2", 7, "apple"),
  ("G2", 8, "apple")
).toDF("group_id", "id", "name")

val group = Window.partitionBy("group_id", "name")

df.
  withColumn("name_appeared_count", count("name").over(group)).
  orderBy("id").
  show
// +--------+---+------+-------------------+
// |group_id| id|  name|name_appeared_count|
// +--------+---+------+-------------------+
// |      G1|  1| apple|                  3|
// |      G1|  2|orange|                  1|
// |      G1|  3| apple|                  3|
// |      G1|  4|banana|                  1|
// |      G1|  5| apple|                  3|
// |      G2|  6|orange|                  1|
// |      G2|  7| apple|                  2|
// |      G2|  8| apple|                  2|
// +--------+---+------+-------------------+

方法

approx\u count\u distinct（$“name”）。超过（组）

计算每个组的distinct

name

，因此这不是基于预期输出的结果。在

分区（“组id”、“名称”）

上使用

计数（“名称”）

将生成所需的计数：

import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

val df = Seq(
  ("G1", 1, "apple"),
  ("G1", 2, "orange"),
  ("G1", 3, "apple"),
  ("G1", 4, "banana"),
  ("G1", 5, "apple"),
  ("G2", 6, "orange"),
  ("G2", 7, "apple"),
  ("G2", 8, "apple")
).toDF("group_id", "id", "name")

val group = Window.partitionBy("group_id", "name")

df.
  withColumn("name_appeared_count", count("name").over(group)).
  orderBy("id").
  show
// +--------+---+------+-------------------+
// |group_id| id|  name|name_appeared_count|
// +--------+---+------+-------------------+
// |      G1|  1| apple|                  3|
// |      G1|  2|orange|                  1|
// |      G1|  3| apple|                  3|
// |      G1|  4|banana|                  1|
// |      G1|  5| apple|                  3|
// |      G2|  6|orange|                  1|
// |      G2|  7| apple|                  2|
// |      G2|  8| apple|                  2|
// +--------+---+------+-------------------+

你的方法遇到了什么问题？它给了我完全独特的。例如，对于G1，我将得到所有5条记录的3条，因为有3个不同的项，根据文档，这是正确的。但我不知道如何才能得到我想要的东西，就像你需要的那样，通过多个col分组、分区。你的方法遇到了什么问题？这给了我完全的独特性。例如，对于G1，我将得到所有5条记录的3条，因为有3个不同的项，根据文档，这是正确的。但我不知道如何才能得到我想要的东西，就像你需要的那样，通过多个cols分组，分区。太好了！这么简单的事！非常感谢你，利奥·克莱菲特！这么简单的事！非常感谢利奥·C