Sql 无法在超过3000万行的分配内存中执行BigQuery百分位函数
我想得到一列数据的百分位数分布。我的问题是这样的Sql 无法在超过3000万行的分配内存中执行BigQuery百分位函数,sql,google-bigquery,Sql,Google Bigquery,我想得到一列数据的百分位数分布。我的问题是这样的 #StandardSQL SELECT PERCENTILE_CONT(age, 0) OVER() AS min, PERCENTILE_CONT(age, 0.05) OVER() AS percentile5, PERCENTILE_CONT(age, 0.25) OVER() AS percentile25, PERCENTILE_CONT(age, 0.50) OVER() AS percentile50, PERCENTILE_C
#StandardSQL
SELECT
PERCENTILE_CONT(age, 0) OVER() AS min,
PERCENTILE_CONT(age, 0.05) OVER() AS percentile5,
PERCENTILE_CONT(age, 0.25) OVER() AS percentile25,
PERCENTILE_CONT(age, 0.50) OVER() AS percentile50,
PERCENTILE_CONT(age, 0.75) OVER() AS percentile75,
PERCENTILE_CONT(age, 0.95) OVER() AS percentile95,
PERCENTILE_CONT(age, 1) OVER() AS max
FROM `data`
然而,我不断地遇到错误
The query could not be executed in the allotted memory.
OVER() operator used too much memory..
我也试着一次运行一行,就像
select PERCENTILE_CONT(age, 0.05) OVER() AS percentile5
from data
但这也会产生同样的错误
我的桌子有3000万行。有什么办法可以优化它吗
谢谢。我会整理您的数据,然后手动计算百分位排名。如果需要插值,也可以手动完成
WITH ORDERED AS (
SELECT
*,
ROW_NUMBER() OVER(ORDER BY age ASC) AS ROWNUM
FROM
`data`
)
SELECT
age AS percentile50
FROM
ORDERED
WHERE
ROWNUM = (
SELECT CEILING(50 / 100.00 * (COUNT(*) + 1)) FROM ORDERED
)
据推测,年龄没有多少价值。如果是这样,您可以汇总数据,然后做您想做的事情
例如:
select min(age) as min,
max(case when running_cnt - cnt < 0.05 * cnt
then age
end) as percentile_05
max(case when running_cnt - cnt < 0.5 * cnt
then age
end) as percentile_50
max(age) as max
from (select age, count(*) as cnt,
sum(count(*)) over (order by age) as running_cnt,
sum(count(*)) over () as total_cnt
from `data`
group by age
) d
你能用近似分位数来代替吗?问题是当窗口是整个表格时,百分位数_CONT无法缩放。@ElliottBrossard对于第25、50、75百分位数来说已经足够了,但是第5和95百分位数呢?我可以做大约20个分位数,然后拿第一个和最后一个桶,但那很不雅观。这是唯一的选择吗?谢谢获得第5和第95百分位有什么错?例如,您可以使用近似分位数年龄,100[OFFSET5]。至少在我的情况下,近似分位数太近似。它的变化约为2%,这在尝试确定分位数的变化时很重要。如果原始数据出现资源错误,我猜这个也会出现。没有分区的有序分析函数在BigQuery中可能是个问题。