Machine learning 如何将培训数据集拆分为培训、验证和测试数据集?
我有一个自定义的图像数据集及其目标。我已经在PyTorch中创建了一个训练数据集。我想把它分为三个部分:培训、验证和测试。如何操作?一旦拥有“主”数据集,就可以使用它进行拆分。Machine learning 如何将培训数据集拆分为培训、验证和测试数据集?,machine-learning,dataset,conv-neural-network,pytorch,Machine Learning,Dataset,Conv Neural Network,Pytorch,我有一个自定义的图像数据集及其目标。我已经在PyTorch中创建了一个训练数据集。我想把它分为三个部分:培训、验证和测试。如何操作?一旦拥有“主”数据集,就可以使用它进行拆分。 下面是一个随机拆分的示例 import torch from torch.utils import data import random master = data.Dataset( ... ) # your "master" dataset n = len(master) # how many total ele
下面是一个随机拆分的示例
import torch
from torch.utils import data
import random
master = data.Dataset( ... ) # your "master" dataset
n = len(master) # how many total elements you have
n_test = int( n * .05 ) # number of test/val elements
n_train = n - 2 * n_test
idx = list(range(n)) # indices to all elements
random.shuffle(idx) # in-place shuffle the indices to facilitate random splitting
train_idx = idx[:n_train]
val_idx = idx[n_train:(n_train + n_test)]
test_idx = idx[(n_train + n_test):]
train_set = data.Subset(master, train_idx)
val_set = data.Subset(master, val_idx)
test_set = data.Subset(master, test_idx)
这也可以通过以下方式实现:
给定参数
train\u frac=0.8
,此函数将数据集拆分为80%、10%、10%:
import torch, itertools
from torch.utils.data import TensorDataset
def dataset_split(dataset, train_frac):
'''
param dataset: Dataset object to be split
param train_frac: Ratio of train set to whole dataset
Randomly split dataset into a dictionary with keys, based on these ratios:
'train': train_frac
'valid': (1-split_frac) / 2
'test': (1-split_frac) / 2
'''
assert split_frac >= 0 and split_frac <= 1, "Invalid training set fraction"
length = len(dataset)
# Use int to get the floor to favour allocation to the smaller valid and test sets
train_length = int(length * train_frac)
valid_length = int((length - train_length) / 2)
test_length = length - train_length - valid_length
dataset = random_split(dataset, (train_length, valid_length, test_length))
dataset = {name: set for name, set in zip(('train', 'valid', 'test'), sets)}
return dataset
导入火炬,itertools
从torch.utils.data导入TensorDataset
def数据集_分割(数据集,序列压裂):
'''
参数数据集:要拆分的数据集对象
param train_frac:序列集与整个数据集的比率
根据以下比率,将数据集随机拆分为带键的字典:
“火车”:火车
“有效”:(1-split_frac)/2
“测试”:(1-split_frac)/2
'''
断言split\u frac>=0和split\u frac
import torch, itertools
from torch.utils.data import TensorDataset
def dataset_split(dataset, train_frac):
'''
param dataset: Dataset object to be split
param train_frac: Ratio of train set to whole dataset
Randomly split dataset into a dictionary with keys, based on these ratios:
'train': train_frac
'valid': (1-split_frac) / 2
'test': (1-split_frac) / 2
'''
assert split_frac >= 0 and split_frac <= 1, "Invalid training set fraction"
length = len(dataset)
# Use int to get the floor to favour allocation to the smaller valid and test sets
train_length = int(length * train_frac)
valid_length = int((length - train_length) / 2)
test_length = length - train_length - valid_length
dataset = random_split(dataset, (train_length, valid_length, test_length))
dataset = {name: set for name, set in zip(('train', 'valid', 'test'), sets)}
return dataset