Sql 区间分层抽样
我有表_1,其中包含如下数据: 范围开始范围结束频率 10 20 90 20 30 68 30 40 314 这里是40191,这意味着我们只有40个数据点重复了191次 表2:Sql 区间分层抽样,sql,oracle,Sql,Oracle,我有表_1,其中包含如下数据: 范围开始范围结束频率 10 20 90 20 30 68 30 40 314 这里是40191,这意味着我们只有40个数据点重复了191次 表2: group value 10 56.1 10 88.3 20 53
group value
10 56.1
10 88.3
20 53
20 20
30 55
我需要根据表1中的范围获得分层样本,表2可以有数百万行,但结果应限制为仅10k点
尝试以下查询:
SELECT
d.*
FROM
(
SELECT
ROW_NUMBER() OVER(
PARTITION BY group
ORDER BY group
) AS seqnum,
COUNT(*) OVER() AS ct,
COUNT(*) OVER(PARTITION BY group) AS cpt,
group, value
FROM
table_2 d
) d
WHERE
seqnum < 10000 * ( cpt * 1.0 / ct )
这意味着您需要每组至少一条记录和随机多条记录,然后尝试以下方法:
SELECT GROUP, VALUE FROM
(SELECT T2.GROUP, T2.VALUE,
ROW_NUMBER()
OVER (PARTITION BY T2.GROUP ORDER BY NULL) AS RN
FROM TABLE_1 T1
JOIN TABLE_2 T2
ON(T1.RANGE = T2.GROUP))
WHERE RN = 1 OR
CASE WHEN RN > 1
AND RN = CEIL(DBMS_RANDOM.VALUE(1,RN))
THEN 1 END = 1
FETCH FIRST 10000 ROWS ONLY;
在这里,每个组的Rownum是随机的,然后结果是Rownum 1和其他Rownum,如果它们满足随机条件
干杯 如果我理解你想要什么-这是绝对不确定的-那么我认为你想要得到最多10000行,组值的数量与频率成比例。因此,您可以使用以下工具从每个范围中获得所需的行数:
select range_start, range_end, frequency,
frequency/sum(frequency) over () as proportion,
floor(10000 * frequency/sum(frequency) over ()) as limit
from table_1;
RANGE_START RANGE_END FREQUENCY PROPORTION LIMIT
----------- ---------- ---------- ---------- ----------
10 20 90 .135746606 1357
20 30 68 .102564103 1025
30 40 314 .473604827 4736
40 40 191 .288084465 2880
这些限制加起来不到10000;你可以用天花板而不是地板稍微高一点
然后,您可以根据表2中的每个条目所处的范围为其分配一个标称行号,然后通过该限制限制该范围内的行数:
with cte1 (range_start, range_end, limit) as (
select range_start, range_end, floor(10000 * frequency/sum(frequency) over ())
from table_1
),
cte2 (grp, value, limit, rn) as (
select t2.grp, t2.value, cte1.limit,
row_number() over (partition by cte1.range_start order by t2.value) as rn
from cte1
join table_2 t2
on (cte1.range_end > cte1.range_start and t2.grp >= cte1.range_start and t2.grp < cte1.range_end)
or (cte1.range_end = cte1.range_start and t2.grp = cte1.range_start)
)
select grp, value
from cte2
where rn <= limit;
...
9998 rows selected.
我在row_number调用中使用了orderby t2.value,因为不清楚如何选择实际需要的范围内的行;您可能希望按dbms_random.value或其他方式进行排序
使用一些人工数据。为什么结果表中只有一条组10的记录?结果表只是一个例子。基本上,结果表应包含表2中记录的亚分级样本,即10000个结果应根据频率成比例,因此30-40范围内的值约为20-30范围内值的5倍,即314/68=4.617。。。多少次?还有,第20组的行是否被计算在10-20范围内,或20-30范围内,或两者都被计算在内?有人可能会假设您包括值>=start和
with cte1 (range_start, range_end, limit) as (
select range_start, range_end, floor(10000 * frequency/sum(frequency) over ())
from table_1
),
cte2 (grp, value, limit, rn) as (
select t2.grp, t2.value, cte1.limit,
row_number() over (partition by cte1.range_start order by t2.value) as rn
from cte1
join table_2 t2
on (cte1.range_end > cte1.range_start and t2.grp >= cte1.range_start and t2.grp < cte1.range_end)
or (cte1.range_end = cte1.range_start and t2.grp = cte1.range_start)
)
select grp, value
from cte2
where rn <= limit;
...
9998 rows selected.