Machine learning 系统在pytorch的第一次历元训练后挂起
所以,我尝试在PyTorch中使用GitHub存储库中的ImageNet示例来训练ResNet模型 下面是我的训练方法的样子(它与示例中的几乎相似) 系统信息: GPU:Nvidia Titan XP 内存:32 Gb PyTorch:0.4.0 当我运行此代码时,训练从历元0开始Machine learning 系统在pytorch的第一次历元训练后挂起,machine-learning,deep-learning,pytorch,resnet,Machine Learning,Deep Learning,Pytorch,Resnet,所以,我尝试在PyTorch中使用GitHub存储库中的ImageNet示例来训练ResNet模型 下面是我的训练方法的样子(它与示例中的几乎相似) 系统信息: GPU:Nvidia Titan XP 内存:32 Gb PyTorch:0.4.0 当我运行此代码时,训练从历元0开始 Epoch: [0][0/108] Time 5.644 (5.644) Data 1.929 (1.929) Loss 6.9052 (6.9052) Prec@1 0.000 (0.000) 然后
Epoch: [0][0/108] Time 5.644 (5.644) Data 1.929 (1.929) Loss 6.9052 (6.9052) Prec@1 0.000 (0.000)
然后远程服务器自动断开连接。事情发生了五次
这是数据加载器:
#Load the Data --> TRAIN
traindir = 'train'
normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
train_dataset = datasets.ImageFolder(traindir, transforms.Compose([
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
normalize,
]))
train_loader = torch.utils.data.DataLoader(
train_dataset,
batch_size=args.batch_size, shuffle=True, num_workers=args.num_workers,
pin_memory=cuda
)
# Load the data --> Validation
valdir = 'valid'
valid_loader = torch.utils.data.DataLoader(
datasets.ImageFolder(valdir, transforms.Compose([
transforms.Resize(256),
transforms.CenterCrop(224),
transforms.ToTensor(),
normalize,
])),
batch_size=args.batch_size, shuffle=False, num_workers=args.num_workers,
pin_memory=cuda
)
if args.evaluate:
validate(valid_loader, model, criterion, epoch=0)
return
# Start
for epoch in range(args.start_epoch, args.epochs):
adjust_learning_rate(optimizer, epoch)
# train for epoch
train(train_loader, model, criterion, optimizer, epoch)
# evaluate on valid
prec1 = validate(valid_loader, model, criterion, epoch)
# remember best prec1 and save checkpoint
is_best = prec1 > best_prec1
best_prec1 = max(prec1, best_prec1)
save_checkpoint({
'epoch': epoch + 1,
'arch': args.arch,
'state_dict': model.state_dict(),
'best_prec1': best_prec1,
'optimizer': optimizer.state_dict()
}, is_best)
对于加载程序,使用以下参数:
args.num_workers = 4
args.batch_size = 32
pin_memory = torch.cuda.is_available()
我的方法有什么问题吗?似乎pytorch的数据加载程序中有一个bug 尝试args.num\u workers=0
args.num_workers = 4
args.batch_size = 32
pin_memory = torch.cuda.is_available()