PostgresSQL中行的可重复伪随机样本_Sql_Postgresql_Random_Sql Order By_Sql Limit

PostgresSQL中行的可重复伪随机样本

sql postgresql random

PostgresSQL中行的可重复伪随机样本,sql,postgresql,random,sql-order-by,sql-limit,Sql,Postgresql,Random,Sql Order By,Sql Limit,我想使用一个随机种子对我的数据的一些子集进行随机抽样，这样它就可以重复。目前，我有一个工作减去种子： select * from my_table where version is not null and start_datetime::date >= date('2020-03-16') and start_datetime::date < date('2020-05-15') order by random() limit 10000 现在我想设置一个随机种子，这样我

我想使用一个随机种子对我的数据的一些子集进行随机抽样，这样它就可以重复。目前，我有一个工作减去种子：

select * from my_table
where version is not null
  and start_datetime::date >= date('2020-03-16')
  and start_datetime::date < date('2020-05-15')
order by random()
limit 10000

现在我想设置一个随机种子，这样我就可以可靠地从这个查询中获得相同的结果

有什么好方法可以做到这一点吗？

根据定义，随机是不可重复的

如果您想再次基于随机值获得相同的排序，那么您别无选择，只能将随机值存储在某个地方

我建议您创建自己的表，添加一个整数列，用随机整数填充：

CREATE TABLE my_random_ordered_sample AS
SELECT
  (RANDOM()*10000)::INT AS rand_ord
, *
FROM mytable
WHERE rpt.assist_detail.version IS NOT NULL
  AND start_datetime::DATE >= '2020-03-16'::DATE
  AND start_datetime::DATE <  '2020-05-15'::DATE
;

一个选项使用。正如文档中所解释的：一旦调用此函数，当前会话中后续随机调用的结果可以通过使用相同的参数重新发出setseed来重复

技术是使用UNIONALL直接在查询中包含对函数的调用，然后在外部查询中进行排序。这需要列出要从查询返回的列。假设您需要col1、col2、col3列，那么您的查询将是：

select * 
from (
    select setseed(0.5) colx, null col1, null col2, null col3
    union all
    select null, col1, col2, col3
    from mytable
    where 
        rpt.assist_detail.version is not null
        and start_datetime::date >= date '2020-03-16'
        and start_datetime::date <  date '2020-05-15'
    offset 1
) t
order by random()
limit 10000

偏移量1用于删除第一个子查询生成的行。setseed 0.5的参数可以是-1和1之间的任意值。只要传递相同的值，就会得到相同的排序

select * 
from (
    select setseed(0.5) colx, null col1, null col2, null col3
    union all
    select null, col1, col2, col3
    from mytable
    where 
        rpt.assist_detail.version is not null
        and start_datetime::date >= date '2020-03-16'
        and start_datetime::date <  date '2020-05-15'
    offset 1
) t
order by random()
limit 10000