Google bigquery BigQuery按前n个类别分组,其余分组在“其他”中

Google bigquery BigQuery按前n个类别分组,其余分组在“其他”中,google-bigquery,Google Bigquery,我经常执行相同的任务——按分类变量中的前X个值聚合数据,并在其他变量中滚动所有其他值 到目前为止,我一直在使用这个技巧: SELECT year, if(tt.state is null, "other", t.state) as state_filtered, count(1) as children FROM [publicdata:samples.natality] as t LEFT OUTER JOIN ( SELECT state, count(1) as children FR

我经常执行相同的任务——按分类变量中的前X个值聚合数据,并在其他变量中滚动所有其他值

到目前为止,我一直在使用这个技巧:

SELECT
year,
if(tt.state is null, "other", t.state) as state_filtered,
count(1) as children
FROM [publicdata:samples.natality] as t
LEFT OUTER JOIN (
  SELECT state, count(1) as children FROM [publicdata:samples.natality]
  WHERE state is not null
  GROUP BY state
  ORDER BY children DESC
  LIMIT 5
) as tt ON tt.state=t.state
GROUP BY year, state_filtered
ORDER BY year, state_filtered
但它不是很干净,因为我查询了同一个表两次,并且在实际示例中代码变得太复杂。我正在寻找一个使用ROLLUP或TOP的解决方案,但没有找到更好的解决方案


有人知道更好的方法吗

您可以在子查询中使用行号

SELECT
  IF (RNB<=5, state, "Other") AS state,
  SUM(children) AS Children
FROM (
  SELECT
    state,
    children,
    ROW_NUMBER() OVER (ORDER BY children DESC) AS RNB
  FROM (
    SELECT
      state,
      COUNT(1) AS children,
    FROM
      [publicdata:samples.natality]
    WHERE
      state IS NOT NULL
    GROUP BY
      state))
GROUP EACH BY
  state

我认为只要选择一个子项就足够了

SELECT 
  year,
  IF (pos <= 5, state, "other") AS state,
  SUM(children) AS children
FROM (
  SELECT
    year,
    state,
    ROW_NUMBER() OVER (PARTITION BY year ORDER BY children DESC) AS pos,
    COUNT(1) AS children,
  FROM
    [publicdata:samples.natality]
  WHERE
    state IS NOT NULL
  GROUP BY
    year, state
)
GROUP BY year, state
ORDER BY year, state

我认为有一个捷径可以让你在全球拥有前五名的州。 没有连接-所以至少在代码方面-它只进行一次扫描!与您当前使用的原始代码相比,它的速度快了两倍。 不确定你是否会喜欢,这取决于你的真实情况

SELECT
  year, 
  state, 
  SUM(children) as children
FROM (
  SELECT
    state,
    REGEXP_EXTRACT(year_info, r'^(\w+)') as year,
    INTEGER(REGEXP_EXTRACT(year_info, r'(\w+)$')) as children,
  FROM (
    SELECT
      CASE WHEN pos < 6 THEN state ELSE 'other' END state,
      SPLIT(years_list) as year_info
    FROM (
      SELECT 
        state,
        GROUP_CONCAT(STRING(year) + '|' + STRING(rows)) as years_list,
        ROW_NUMBER() OVER(ORDER BY children DESC) as pos,
        SUM(rows) as children
      FROM (
        SELECT year, state, COUNT(1) AS rows
        FROM [publicdata:samples.natality]
        WHERE state IS NOT NULL
        GROUP BY year, state
      )    
      GROUP BY state
    )
  )
)
GROUP BY year, state
ORDER BY year, state

我觉得有一种更好的方法来处理分组合并/拆分技巧

如果你想计算全球前5个州,那么就没有办法避免两次扫描。但是,如果你想在每年计算不同的前5个州,那么可能会有一个解决方案。这是一个漂亮的解决方案,但有一个问题。每年都会有不同的前5个州。就我而言,我一直在选择一套。看来我没有捷径了。