如何对特定的SQL列进行分组，并检索那些列计数最高的行？_Sql_Apache Spark_Hive_Hiveql_Greatest N Per Group

如何对特定的SQL列进行分组，并检索那些列计数最高的行？

sql apache-spark hive

如何对特定的SQL列进行分组，并检索那些列计数最高的行？,sql,apache-spark,hive,hiveql,greatest-n-per-group,Sql,Apache Spark,Hive,Hiveql,Greatest N Per Group,我有以下数据： col_1 | col_2 | col_3 | col_4 ----------------------------- a1 b1 c1 d1 a1 b2 c1 d1 a1 b3 c1 d1 a1 b4 c1 d2 a1 b5 c2 d2 a1 b6 c2 d2 a1 b7 c1

我有以下数据：

col_1 | col_2 | col_3 | col_4
-----------------------------
a1      b1      c1      d1
a1      b2      c1      d1
a1      b3      c1      d1
a1      b4      c1      d2
a1      b5      c2      d2
a1      b6      c2      d2
a1      b7      c1      d3
a1      b8      c2      d3
a1      b9      c3      d3
a1      b10     c1      d2
a1      b11     c2      d3
a2      b12     c1      d1
a3      b13     c1      d1

我有兴趣能够：

返回
```
列1
```
的值唯一的行
对于结果中的每一行，它应该返回按以下方式分组时计数最高的列的值：
```
col_3
```
，
```
col_4
```

例如，我希望输出返回以下内容：

col_1 | col_2 | col_3 | col_4
-----------------------------
a1      b1      c1      d1
a2      b12     c1      d1
a3      b13     c1      d1

请注意，

col_1

中的每个值都是唯一的。还要注意的是，对于

a1

，它返回的是

c1

和

d1

，因为它们的

a1

组计数最高

如何通过SQL查询实现这一点？我将使用它进行配置单元SQL查询。

您可以使用聚合和窗口函数：

select col_1, col_2, col_3, col_4
from (
    select
        col_1, 
        col_2, 
        col_3, 
        col_4, 
        rank() over(partition by col_1 order by count(*) desc) rn
    from mytable t
    group by col_1, col_2, col_3, col_4
) t
where rn = 1

使用

行号（）

窗口功能：

select t.col_1, t.col_2, t.col_3, t.col_4
from (
  select col_1, min(col_2) col_2, col_3, col_4,
    row_number() over (partition by col_1 order by count(*) desc) rn
  from tablename
  group by col_1, col_3, col_4
) t
where t.rn = 1

请参阅。
结果:

如果需要完整的行，可以使用窗口函数：

select t.*
from (select t.*,
             rank() over (partition by col1 order by cnt desc) as seqnum
      from (select t.*, count(*) over (partition by col1, col3, col4) as cnt
            from t
           ) t
     ) t
where seqnum = 1;

最里面的子查询统计每个col1/col3/col4组合的行数。中间子查询为每个

col1

枚举计数最高的行。最外层的过滤器用于最高计数。

您已经用几个s标记了您的问题。请将它重新标记为您实际使用的一个。这是否回答了您的问题？

select t.*
from (select t.*,
             rank() over (partition by col1 order by cnt desc) as seqnum
      from (select t.*, count(*) over (partition by col1, col3, col4) as cnt
            from t
           ) t
     ) t
where seqnum = 1;