如何对特定的SQL列进行分组,并检索那些列计数最高的行?
我有以下数据:如何对特定的SQL列进行分组,并检索那些列计数最高的行?,sql,apache-spark,hive,hiveql,greatest-n-per-group,Sql,Apache Spark,Hive,Hiveql,Greatest N Per Group,我有以下数据: col_1 | col_2 | col_3 | col_4 ----------------------------- a1 b1 c1 d1 a1 b2 c1 d1 a1 b3 c1 d1 a1 b4 c1 d2 a1 b5 c2 d2 a1 b6 c2 d2 a1 b7 c1
col_1 | col_2 | col_3 | col_4
-----------------------------
a1 b1 c1 d1
a1 b2 c1 d1
a1 b3 c1 d1
a1 b4 c1 d2
a1 b5 c2 d2
a1 b6 c2 d2
a1 b7 c1 d3
a1 b8 c2 d3
a1 b9 c3 d3
a1 b10 c1 d2
a1 b11 c2 d3
a2 b12 c1 d1
a3 b13 c1 d1
我有兴趣能够:
- 返回
的值唯一的行列1
- 对于结果中的每一行,它应该返回按以下方式分组时计数最高的列的值:
,col_3
col_4
col_1 | col_2 | col_3 | col_4
-----------------------------
a1 b1 c1 d1
a2 b12 c1 d1
a3 b13 c1 d1
请注意,col_1
中的每个值都是唯一的。还要注意的是,对于a1
,它返回的是c1
和d1
,因为它们的a1
组计数最高
如何通过SQL查询实现这一点?我将使用它进行配置单元SQL查询。您可以使用聚合和窗口函数:
select col_1, col_2, col_3, col_4
from (
select
col_1,
col_2,
col_3,
col_4,
rank() over(partition by col_1 order by count(*) desc) rn
from mytable t
group by col_1, col_2, col_3, col_4
) t
where rn = 1
使用行号()
窗口功能:
select t.col_1, t.col_2, t.col_3, t.col_4
from (
select col_1, min(col_2) col_2, col_3, col_4,
row_number() over (partition by col_1 order by count(*) desc) rn
from tablename
group by col_1, col_3, col_4
) t
where t.rn = 1
请参阅。结果:
如果需要完整的行,可以使用窗口函数:
select t.*
from (select t.*,
rank() over (partition by col1 order by cnt desc) as seqnum
from (select t.*, count(*) over (partition by col1, col3, col4) as cnt
from t
) t
) t
where seqnum = 1;
最里面的子查询统计每个col1/col3/col4组合的行数。中间子查询为每个
col1
枚举计数最高的行。最外层的过滤器用于最高计数。您已经用几个s标记了您的问题。请将它重新标记为您实际使用的一个。这是否回答了您的问题?
select t.*
from (select t.*,
rank() over (partition by col1 order by cnt desc) as seqnum
from (select t.*, count(*) over (partition by col1, col3, col4) as cnt
from t
) t
) t
where seqnum = 1;