Postgresql 在postgres中将数据集拆分为训练集和测试集
我有一个数据集,我想使用postgres sql将其划分为70:30的比例,并将其划分为训练集和测试集。我该怎么做呢。我使用了以下代码,但似乎不起作用Postgresql 在postgres中将数据集拆分为训练集和测试集,postgresql,Postgresql,我有一个数据集,我想使用postgres sql将其划分为70:30的比例,并将其划分为训练集和测试集。我该怎么做呢。我使用了以下代码,但似乎不起作用 create table training_test as ( WITH TEMP as ( SELECT ROW_NUMBER() AS ROW_ID , Random() as RANDOM_VALUE,D.* FROM analytics.model_data_discharge_v1 as D O
create table training_test as
(
WITH TEMP as
(
SELECT ROW_NUMBER() AS ROW_ID , Random() as RANDOM_VALUE,D.*
FROM analytics.model_data_discharge_v1 as D
ORDER BY RANDOM_VALUE
)
SELECT 'Training',T.* FROM TEMP T
WHERE ROW_ID <= 493896*0.70
UNION
SELECT 'Test',T.* FROM TEMP T
WHERE ROW_ID > 493896*0.70
) distributed by(hospitalaccountrecord);
如果需要分层拆分,可以使用以下代码 第一位保证每个组具有要拆分的最小大小
with ssize as (
select
group
from to_split_table
group by group
having count(*) >= {{ MINIMUM GROUP SIZE }}) -- {{ MINIMUM GROUP SIZE }} = 1 / {{ TEST_THRESHOLD }}
select
id_aux,
ts.group,
case
when
cast(row_number() over (partition by ts.group order by rand()) as double) / cast(count() over (partition by ts.group) as double)
< {{ TEST_THRESHOLD }} then 'test'
else 'train'
end as splitting
from to_split_table ts
join ssize
on ts.group = ssize.group
不使用随机拆分是不可重复的!random每次将返回不同的结果 例如,您可以按照谷歌云的建议,使用哈希和模来分割数据集 使用与目标不相关的列/特征进行散列,以避免在训练集中留下有价值的信息。否则,将所有字段连接为JSON字符串并在该字符串上进行哈希 取散列的绝对值 计算列的模10 如果结果<8,则它将成为80%训练集的一部分 如果结果==8,则它将成为20%测试集的一部分 我从GCP ML课程中获取的使用BiQuery的示例: 训练集 测试集
这样,您每次都可以得到准确的80%数据。随机拆分是不可重复的!random每次将返回不同的结果。
with ssize as (
select
group
from to_split_table
group by group
having count(*) >= {{ MINIMUM GROUP SIZE }}) -- {{ MINIMUM GROUP SIZE }} = 1 / {{ TEST_THRESHOLD }}
select
id_aux,
ts.group,
case
when
cast(row_number() over (partition by ts.group order by rand()) as double) / cast(count() over (partition by ts.group) as double)
< {{ TEST_THRESHOLD }} then 'test'
else 'train'
end as splitting
from to_split_table ts
join ssize
on ts.group = ssize.group