Google bigquery BigQuery按前n个类别分组,其余分组在“其他”中
我经常执行相同的任务——按分类变量中的前X个值聚合数据,并在其他变量中滚动所有其他值 到目前为止,我一直在使用这个技巧:Google bigquery BigQuery按前n个类别分组,其余分组在“其他”中,google-bigquery,Google Bigquery,我经常执行相同的任务——按分类变量中的前X个值聚合数据,并在其他变量中滚动所有其他值 到目前为止,我一直在使用这个技巧: SELECT year, if(tt.state is null, "other", t.state) as state_filtered, count(1) as children FROM [publicdata:samples.natality] as t LEFT OUTER JOIN ( SELECT state, count(1) as children FR
SELECT
year,
if(tt.state is null, "other", t.state) as state_filtered,
count(1) as children
FROM [publicdata:samples.natality] as t
LEFT OUTER JOIN (
SELECT state, count(1) as children FROM [publicdata:samples.natality]
WHERE state is not null
GROUP BY state
ORDER BY children DESC
LIMIT 5
) as tt ON tt.state=t.state
GROUP BY year, state_filtered
ORDER BY year, state_filtered
但它不是很干净,因为我查询了同一个表两次,并且在实际示例中代码变得太复杂。我正在寻找一个使用ROLLUP或TOP的解决方案,但没有找到更好的解决方案
有人知道更好的方法吗 您可以在子查询中使用行号
SELECT
IF (RNB<=5, state, "Other") AS state,
SUM(children) AS Children
FROM (
SELECT
state,
children,
ROW_NUMBER() OVER (ORDER BY children DESC) AS RNB
FROM (
SELECT
state,
COUNT(1) AS children,
FROM
[publicdata:samples.natality]
WHERE
state IS NOT NULL
GROUP BY
state))
GROUP EACH BY
state
我认为只要选择一个子项就足够了
SELECT
year,
IF (pos <= 5, state, "other") AS state,
SUM(children) AS children
FROM (
SELECT
year,
state,
ROW_NUMBER() OVER (PARTITION BY year ORDER BY children DESC) AS pos,
COUNT(1) AS children,
FROM
[publicdata:samples.natality]
WHERE
state IS NOT NULL
GROUP BY
year, state
)
GROUP BY year, state
ORDER BY year, state
我认为有一个捷径可以让你在全球拥有前五名的州。 没有连接-所以至少在代码方面-它只进行一次扫描!与您当前使用的原始代码相比,它的速度快了两倍。 不确定你是否会喜欢,这取决于你的真实情况
SELECT
year,
state,
SUM(children) as children
FROM (
SELECT
state,
REGEXP_EXTRACT(year_info, r'^(\w+)') as year,
INTEGER(REGEXP_EXTRACT(year_info, r'(\w+)$')) as children,
FROM (
SELECT
CASE WHEN pos < 6 THEN state ELSE 'other' END state,
SPLIT(years_list) as year_info
FROM (
SELECT
state,
GROUP_CONCAT(STRING(year) + '|' + STRING(rows)) as years_list,
ROW_NUMBER() OVER(ORDER BY children DESC) as pos,
SUM(rows) as children
FROM (
SELECT year, state, COUNT(1) AS rows
FROM [publicdata:samples.natality]
WHERE state IS NOT NULL
GROUP BY year, state
)
GROUP BY state
)
)
)
GROUP BY year, state
ORDER BY year, state
我觉得有一种更好的方法来处理分组合并/拆分技巧如果你想计算全球前5个州,那么就没有办法避免两次扫描。但是,如果你想在每年计算不同的前5个州,那么可能会有一个解决方案。这是一个漂亮的解决方案,但有一个问题。每年都会有不同的前5个州。就我而言,我一直在选择一套。看来我没有捷径了。