Postgresql 在postgres中将数据集拆分为训练集和测试集

Postgresql 在postgres中将数据集拆分为训练集和测试集,postgresql,Postgresql,我有一个数据集,我想使用postgres sql将其划分为70:30的比例,并将其划分为训练集和测试集。我该怎么做呢。我使用了以下代码,但似乎不起作用 create table training_test as ( WITH TEMP as ( SELECT ROW_NUMBER() AS ROW_ID , Random() as RANDOM_VALUE,D.* FROM analytics.model_data_discharge_v1 as D O

我有一个数据集,我想使用postgres sql将其划分为70:30的比例,并将其划分为训练集和测试集。我该怎么做呢。我使用了以下代码,但似乎不起作用

create table training_test as 
(
WITH TEMP as
(
  SELECT  ROW_NUMBER() AS ROW_ID , Random() as RANDOM_VALUE,D.*
        FROM  analytics.model_data_discharge_v1  as D
       ORDER BY RANDOM_VALUE
)

SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID <= 493896*0.70
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 493896*0.70
) distributed by(hospitalaccountrecord);

如果需要分层拆分,可以使用以下代码

第一位保证每个组具有要拆分的最小大小

with ssize as (
    select
        group
    from  to_split_table
    group by group
    having count(*) >= {{ MINIMUM GROUP SIZE }}) -- {{ MINIMUM GROUP SIZE }} = 1 / {{ TEST_THRESHOLD }}
select
    id_aux,
    ts.group,
    case
        when
        cast(row_number() over (partition by ts.group order by rand()) as double) / cast(count() over (partition by ts.group) as double)
        < {{ TEST_THRESHOLD }} then 'test'
        else 'train'
    end as splitting
from  to_split_table ts
join ssize
on ts.group = ssize.group

不使用随机拆分是不可重复的!random每次将返回不同的结果

例如,您可以按照谷歌云的建议,使用哈希和模来分割数据集

使用与目标不相关的列/特征进行散列,以避免在训练集中留下有价值的信息。否则,将所有字段连接为JSON字符串并在该字符串上进行哈希 取散列的绝对值 计算列的模10 如果结果<8,则它将成为80%训练集的一部分 如果结果==8,则它将成为20%测试集的一部分 我从GCP ML课程中获取的使用BiQuery的示例:

训练集

测试集


这样,您每次都可以得到准确的80%数据。

随机拆分是不可重复的!random每次将返回不同的结果。
with ssize as (
    select
        group
    from  to_split_table
    group by group
    having count(*) >= {{ MINIMUM GROUP SIZE }}) -- {{ MINIMUM GROUP SIZE }} = 1 / {{ TEST_THRESHOLD }}
select
    id_aux,
    ts.group,
    case
        when
        cast(row_number() over (partition by ts.group order by rand()) as double) / cast(count() over (partition by ts.group) as double)
        < {{ TEST_THRESHOLD }} then 'test'
        else 'train'
    end as splitting
from  to_split_table ts
join ssize
on ts.group = ssize.group