Apache spark Spark SQL-使用Spark SQL窗口函数获取每个窗口的行数

Apache spark Spark SQL-使用Spark SQL窗口函数获取每个窗口的行数,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我想使用spark SQL窗口函数进行一些聚合和窗口设置 假设我使用此处提供的示例表a: 我想运行查询,为每个类别提供最大2个收入以及每个类别的产品计数 在我运行这个查询之后 SELECT product, category, revenue FROM ( SELECT product, category, revenue, dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC

我想使用spark SQL窗口函数进行一些聚合和窗口设置

假设我使用此处提供的示例表a:

我想运行查询,为每个类别提供最大2个收入以及每个类别的产品计数

在我运行这个查询之后

SELECT
  product,
  category,
  revenue
FROM (
  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
    count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count
  FROM productRevenue) tmp
WHERE
  rank <= 2
而不是

product category    revenue count
pro2    tablet  6500    5
mini    tablet  5500    5
这正是我所期望的

我应该如何编写代码以获得每个类别的正确计数,而不是使用另一个单独的Group By语句?

在Spark if window子句中,order By window默认为无界前一行和当前行之间的行

对于您的情况,在count*window子句中,在无界前后行之间添加行

尝试:

将计数*按类别划分顺序按收入描述更改为计数*按类别划分顺序按类别描述更改为计数。你会得到预期的结果

试试下面的代码

scala> spark.sql("""SELECT
     |   product,
     |   category,
     |   revenue,
     |   rank,
     |   count
     | FROM (
     |   SELECT
     |     product,
     |     category,
     |     revenue,
     |     dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
     |     count(*) OVER (PARTITION BY category ORDER BY category DESC) as count
     |   FROM productRevenue) tmp
     | WHERE
     |   tmp.rank <= 2 """).show(false)

+----------+----------+-------+----+-----+
|product   |category  |revenue|rank|count|
+----------+----------+-------+----+-----+
|Pro2      |tablet    |6500   |1   |5    |
|Mini      |tablet    |5500   |2   |5    |
|Thin      |cell phone|6000   |1   |5    |
|Very thin |cell phone|6000   |1   |5    |
|Ultra thin|cell phone|5000   |2   |5    |
+----------+----------+-------+----+-----+  

 SELECT
  product,
  category,
  revenue,count
FROM (
  SELECT
    product,
    category,
    revenue,
    dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
    count(*) OVER (PARTITION BY category ORDER BY revenue DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as count
  FROM productRevenue) tmp
WHERE
  rank <= 2
scala> spark.sql("""SELECT
     |   product,
     |   category,
     |   revenue,
     |   rank,
     |   count
     | FROM (
     |   SELECT
     |     product,
     |     category,
     |     revenue,
     |     dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
     |     count(*) OVER (PARTITION BY category ORDER BY category DESC) as count
     |   FROM productRevenue) tmp
     | WHERE
     |   tmp.rank <= 2 """).show(false)

+----------+----------+-------+----+-----+
|product   |category  |revenue|rank|count|
+----------+----------+-------+----+-----+
|Pro2      |tablet    |6500   |1   |5    |
|Mini      |tablet    |5500   |2   |5    |
|Thin      |cell phone|6000   |1   |5    |
|Very thin |cell phone|6000   |1   |5    |
|Ultra thin|cell phone|5000   |2   |5    |
+----------+----------+-------+----+-----+