Sql 分组时为分类变量选择最高计数_Sql_Hiveql

Sql 分组时为分类变量选择最高计数

sql

Sql 分组时为分类变量选择最高计数,sql,hiveql,Sql,Hiveql,我有下表： custID Cat 1 A 1 B 1 B 1 B 1 C 2 A 2 A 2 C 3 B 3 C 4 A 4 C 4 C 4 C 我需要的是通过CustID进行聚合的最有效的方式，以获得最频繁的类别（cat）、第二最频繁的类别和第三个类别。上面的输出应该是 most freq 2nd most fr

我有下表：

custID  Cat
   1    A
   1    B
   1    B
   1    B
   1    C
   2    A
   2    A
   2    C
   3    B
   3    C
   4    A
   4    C
   4    C
   4    C

我需要的是通过CustID进行聚合的最有效的方式，以获得最频繁的类别（cat）、第二最频繁的类别和第三个类别。上面的输出应该是

    most freq   2nd most freq   3rd most freq
1       B             A              C
2       A             C             Null
3       B             C             Null
4       C             A             Null

当计数出现平局时，我并不真正关心什么是第一位，什么是第二位。例如，对于客户1，第二多个频率和第三多个频率可以互换，因为每个频率仅发生一次

任何sql都可以，最好是配置单元sql

谢谢

尝试使用

分组方式

两次，并使用

密集等级（）

根据

类别

计数进行排序。事实上，我不是100%确定，但我想它也应该在蜂箱中起作用

select custId,
    max(case when t.rn = 1 then cat end) as [most freq],
    max(case when t.rn = 2 then cat end) as [2nd most freq],
    max(case when t.rn = 3 then cat end) as [3th most freq]
from
(
  select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn
  from your_table 
  group by custId, cat
) t
group by custId

根据这些评论，我添加了一个稍加修改的符合HiveSQL的解决方案

select custId,
    max(case when t.rn = 1 then cat else null end) as most_freq,
    max(case when t.rn = 2 then cat else null end) as 2nd_most_freq,
    max(case when t.rn = 3 then cat else null end) as 3th_most_freq
from
(
  select custId, cat, dense_rank() over (partition by custId order by ct desc) rn
  from (
    select custId, cat, count(*) ct
    from your_table 
    group by custId, cat
  ) your_table_with_counts
) t
group by custId

尝试使用

分组方式

两次，并使用

密集等级（）

根据

类别

计数进行排序。事实上，我不是100%确定，但我想它也应该在蜂箱中起作用

select custId,
    max(case when t.rn = 1 then cat end) as [most freq],
    max(case when t.rn = 2 then cat end) as [2nd most freq],
    max(case when t.rn = 3 then cat end) as [3th most freq]
from
(
  select custId, cat, dense_rank() over (partition by custId order by count(*) desc) rn
  from your_table 
  group by custId, cat
) t
group by custId

根据这些评论，我添加了一个稍加修改的符合HiveSQL的解决方案

select custId,
    max(case when t.rn = 1 then cat else null end) as most_freq,
    max(case when t.rn = 2 then cat else null end) as 2nd_most_freq,
    max(case when t.rn = 3 then cat else null end) as 3th_most_freq
from
(
  select custId, cat, dense_rank() over (partition by custId order by ct desc) rn
  from (
    select custId, cat, count(*) ct
    from your_table 
    group by custId, cat
  ) your_table_with_counts
) t
group by custId

使用

densite\u rank

而不是

row\u number

，这样，如果存在联系，联系就不会出现在第二个和第三个最频繁的值中。同时删除列别名的

[]

，因为它们在配置单元中不受支持。我运行了它，但出现了以下错误：编译语句时出错：失败：SemanticException无法将窗口调用拆分为组。至少有1个组必须仅依赖于输入列。还要检查循环依赖关系。基本错误：org.apache.hadoop.hive.ql.parse.SemanticException:第7:84行尚未支持UDAF“count”的位置“。可能是配置单元sql问题。这个问题纯粹是技术性的。我想你们已经从数据逻辑的角度解决了这个问题，对此我非常感谢你们两位。@criticalth好的，当你们尝试这个的时候@RadimBača我不明白什么表应该是您的带有列名的表，请使用

densite\u-rank

而不是

row\u-number

，以便在存在列名的情况下，不会以第二和第三个最常见的值显示。同时删除列名的

[]