Sql 区间分层抽样

Sql 区间分层抽样,sql,oracle,Sql,Oracle,我有表_1,其中包含如下数据: 范围开始范围结束频率 10 20 90 20 30 68 30 40 314 这里是40191,这意味着我们只有40个数据点重复了191次 表2: group value 10 56.1 10 88.3 20 53

我有表_1,其中包含如下数据:

范围开始范围结束频率 10 20 90 20 30 68 30 40 314 这里是40191,这意味着我们只有40个数据点重复了191次 表2:

group     value   
10        56.1   
10        88.3   
20        53   
20        20   
30        55   
我需要根据表1中的范围获得分层样本,表2可以有数百万行,但结果应限制为仅10k点

尝试以下查询:

SELECT   
    d.*   
FROM   
    (   
        SELECT   
            ROW_NUMBER() OVER(   
                                PARTITION BY group   
                                ORDER BY group   
                            ) AS seqnum,   
            COUNT(*) OVER() AS ct,   
            COUNT(*) OVER(PARTITION BY group) AS cpt,   
            group, value   
        FROM   
            table_2 d   
    ) d   
WHERE   
    seqnum < 10000 * ( cpt * 1.0 / ct )   

这意味着您需要每组至少一条记录和随机多条记录,然后尝试以下方法:

SELECT GROUP, VALUE FROM
(SELECT T2.GROUP, T2.VALUE, 
ROW_NUMBER() 
OVER (PARTITION BY T2.GROUP ORDER BY NULL) AS RN
FROM TABLE_1 T1
JOIN TABLE_2 T2
ON(T1.RANGE = T2.GROUP))
WHERE RN = 1 OR
CASE WHEN RN > 1 
AND RN = CEIL(DBMS_RANDOM.VALUE(1,RN))
THEN 1 END = 1
FETCH FIRST 10000 ROWS ONLY;
在这里,每个组的Rownum是随机的,然后结果是Rownum 1和其他Rownum,如果它们满足随机条件


干杯

如果我理解你想要什么-这是绝对不确定的-那么我认为你想要得到最多10000行,组值的数量与频率成比例。因此,您可以使用以下工具从每个范围中获得所需的行数:

select range_start, range_end, frequency,
  frequency/sum(frequency) over () as proportion,
  floor(10000 * frequency/sum(frequency) over ()) as limit
from table_1;

RANGE_START  RANGE_END  FREQUENCY PROPORTION      LIMIT
----------- ---------- ---------- ---------- ----------
         10         20         90 .135746606       1357
         20         30         68 .102564103       1025
         30         40        314 .473604827       4736
         40         40        191 .288084465       2880
这些限制加起来不到10000;你可以用天花板而不是地板稍微高一点

然后,您可以根据表2中的每个条目所处的范围为其分配一个标称行号,然后通过该限制限制该范围内的行数:

with cte1 (range_start, range_end, limit) as (
  select range_start, range_end, floor(10000 * frequency/sum(frequency) over ())
  from table_1
),
cte2 (grp, value, limit, rn) as (
  select t2.grp, t2.value, cte1.limit,
    row_number() over (partition by cte1.range_start order by t2.value) as rn
  from cte1
  join table_2 t2
  on (cte1.range_end > cte1.range_start and t2.grp >= cte1.range_start and t2.grp < cte1.range_end)
  or (cte1.range_end = cte1.range_start and t2.grp = cte1.range_start)
)
select grp, value
from cte2
where rn <= limit;

...

9998 rows selected.
我在row_number调用中使用了orderby t2.value,因为不清楚如何选择实际需要的范围内的行;您可能希望按dbms_random.value或其他方式进行排序


使用一些人工数据。

为什么结果表中只有一条组10的记录?结果表只是一个例子。基本上,结果表应包含表2中记录的亚分级样本,即10000个结果应根据频率成比例,因此30-40范围内的值约为20-30范围内值的5倍,即314/68=4.617。。。多少次?还有,第20组的行是否被计算在10-20范围内,或20-30范围内,或两者都被计算在内?有人可能会假设您包括值>=start和with cte1 (range_start, range_end, limit) as ( select range_start, range_end, floor(10000 * frequency/sum(frequency) over ()) from table_1 ), cte2 (grp, value, limit, rn) as ( select t2.grp, t2.value, cte1.limit, row_number() over (partition by cte1.range_start order by t2.value) as rn from cte1 join table_2 t2 on (cte1.range_end > cte1.range_start and t2.grp >= cte1.range_start and t2.grp < cte1.range_end) or (cte1.range_end = cte1.range_start and t2.grp = cte1.range_start) ) select grp, value from cte2 where rn <= limit; ... 9998 rows selected.