Apache spark Spark SQL-使用Spark SQL窗口函数获取每个窗口的行数
我想使用spark SQL窗口函数进行一些聚合和窗口设置 假设我使用此处提供的示例表a: 我想运行查询,为每个类别提供最大2个收入以及每个类别的产品计数 在我运行这个查询之后Apache spark Spark SQL-使用Spark SQL窗口函数获取每个窗口的行数,apache-spark,apache-spark-sql,Apache Spark,Apache Spark Sql,我想使用spark SQL窗口函数进行一些聚合和窗口设置 假设我使用此处提供的示例表a: 我想运行查询,为每个类别提供最大2个收入以及每个类别的产品计数 在我运行这个查询之后 SELECT product, category, revenue FROM ( SELECT product, category, revenue, dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC
SELECT
product,
category,
revenue
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank
count(*) OVER (PARTITION BY category ORDER BY revenue DESC) as count
FROM productRevenue) tmp
WHERE
rank <= 2
而不是
product category revenue count
pro2 tablet 6500 5
mini tablet 5500 5
这正是我所期望的
我应该如何编写代码以获得每个类别的正确计数,而不是使用另一个单独的Group By语句?在Spark if window子句中,order By window默认为无界前一行和当前行之间的行
对于您的情况,在count*window子句中,在无界前后行之间添加行
尝试:
将计数*按类别划分顺序按收入描述更改为计数*按类别划分顺序按类别描述更改为计数。你会得到预期的结果
试试下面的代码
scala> spark.sql("""SELECT
| product,
| category,
| revenue,
| rank,
| count
| FROM (
| SELECT
| product,
| category,
| revenue,
| dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
| count(*) OVER (PARTITION BY category ORDER BY category DESC) as count
| FROM productRevenue) tmp
| WHERE
| tmp.rank <= 2 """).show(false)
+----------+----------+-------+----+-----+
|product |category |revenue|rank|count|
+----------+----------+-------+----+-----+
|Pro2 |tablet |6500 |1 |5 |
|Mini |tablet |5500 |2 |5 |
|Thin |cell phone|6000 |1 |5 |
|Very thin |cell phone|6000 |1 |5 |
|Ultra thin|cell phone|5000 |2 |5 |
+----------+----------+-------+----+-----+
SELECT
product,
category,
revenue,count
FROM (
SELECT
product,
category,
revenue,
dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
count(*) OVER (PARTITION BY category ORDER BY revenue DESC ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as count
FROM productRevenue) tmp
WHERE
rank <= 2
scala> spark.sql("""SELECT
| product,
| category,
| revenue,
| rank,
| count
| FROM (
| SELECT
| product,
| category,
| revenue,
| dense_rank() OVER (PARTITION BY category ORDER BY revenue DESC) as rank,
| count(*) OVER (PARTITION BY category ORDER BY category DESC) as count
| FROM productRevenue) tmp
| WHERE
| tmp.rank <= 2 """).show(false)
+----------+----------+-------+----+-----+
|product |category |revenue|rank|count|
+----------+----------+-------+----+-----+
|Pro2 |tablet |6500 |1 |5 |
|Mini |tablet |5500 |2 |5 |
|Thin |cell phone|6000 |1 |5 |
|Very thin |cell phone|6000 |1 |5 |
|Ultra thin|cell phone|5000 |2 |5 |
+----------+----------+-------+----+-----+