Loops model.train()和model.eval()生成nan值
嘿,所以我正在尝试我的手在图像分类/转移学习使用猴子物种数据集和resnet50与修改后的最后一个fc层预测只是10类。在我使用model.train()和model.eval()之前,一切都是正常的,但在第一个历元之后,它开始返回nans,并且精度会下降,如下所示。我很好奇,为什么只有在切换到列车/评估时才会出现这种情况 首先,我导入模型并附加分类器并冻结参数Loops model.train()和model.eval()生成nan值,loops,validation,neural-network,pytorch,nan,Loops,Validation,Neural Network,Pytorch,Nan,嘿,所以我正在尝试我的手在图像分类/转移学习使用猴子物种数据集和resnet50与修改后的最后一个fc层预测只是10类。在我使用model.train()和model.eval()之前,一切都是正常的,但在第一个历元之后,它开始返回nans,并且精度会下降,如下所示。我很好奇,为什么只有在切换到列车/评估时才会出现这种情况 首先,我导入模型并附加分类器并冻结参数 %%capture resnet = models.resnet50(pretrained=True) for param in r
%%capture
resnet = models.resnet50(pretrained=True)
for param in resnet.parameters():
param.required_grad = False
in_features = resnet.fc.in_features
# Build custom classifier
classifier = nn.Sequential(OrderedDict([('fc1', nn.Linear(in_features, 512)),
('relu', nn.ReLU()),
('drop', nn.Dropout(0.05)),
('fc2', nn.Linear(512, 10)),
]))
# ('output', nn.LogSoftmax(dim=1))
resnet.classifier = classifier
resnet.to(device)
然后设置我的损失函数、优化器和shceduler
# Step : Define criterion and optimizer
criterion = nn.CrossEntropyLoss()
# pass the optimizer to the appended classifier layer
optimizer = torch.optim.SGD(resnet.parameters(), lr=0.01)
# Scheduler
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10], gamma=0.05)
然后设置培训和验证循环
epochs = 20
tr_losses = []
avg_epoch_tr_loss = []
tr_accuracy = []
val_losses = []
avg_epoch_val_loss = []
val_accuracy = []
val_loss_min = np.Inf
resnet.train()
for epoch in range(epochs):
for i, batch in enumerate(train_loader):
# Pull the data and labels from the batch
data, label = batch
# If available push data and label to GPU
if train_on_gpu:
data, label = data.to(device), label.to(device)
# Compute the logit
logit = resnet(data)
# Compte loss
loss = criterion(logit, label)
# Clearing the gradient
resnet.zero_grad()
# Backpropagate the gradients (accumulte the partial derivatives of loss)
loss.backward()
# Apply the updates to the optimizer step in the opposite direction to the gradient
optimizer.step()
# Store the losses of each batch
# loss.item() seperates the loss from comp graph
tr_losses.append(loss.item())
# Detach and store the average accuracy of each batch
tr_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
# Print the rolling batch training loss every 20 batches
if i % 40 == 0 and not i == 1:
print(f'Batch No: {i} \tAverage Training Batch Loss: {torch.tensor(tr_losses).mean():.2f}')
# Print the average loss for each epoch
print(f'\nEpoch No: {epoch + 1},Training Loss: {torch.tensor(tr_losses).mean():.2f}')
# Print the average accuracy for each epoch
print(f'Epoch No: {epoch + 1}, Training Accuracy: {torch.tensor(tr_accuracy).mean():.2f}\n')
# Store the avg epoch loss for plotting
avg_epoch_tr_loss.append(torch.tensor(tr_losses).mean())
resnet.eval()
for i, batch in enumerate(val_loader):
# Pull the data and labels from the batch
data, label = batch
# If available push data and label to GPU
if train_on_gpu:
data, label = data.to(device), label.to(device)
# Compute the logits without computing the gradients
with torch.no_grad():
logit = resnet(data)
# Compte loss
loss = criterion(logit, label)
# Store test loss
val_losses.append(loss.item())
# Store the accuracy for each batch
val_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
if i % 20 == 0 and not i == 1:
print(f'Batch No: {i+1} \tAverage Val Batch Loss: {torch.tensor(val_losses).mean():.2f}')
# Print the average loss for each epoch
print(f'\nEpoch No: {epoch + 1}, Epoch Val Loss: {torch.tensor(val_losses).mean():.2f}')
# Print the average accuracy for each epoch
print(f'Epoch No: {epoch + 1}, Epoch Val Accuracy: {torch.tensor(val_accuracy).mean():.2f}\n')
# Store the avg epoch loss for plotting
avg_epoch_val_loss.append(torch.tensor(val_losses).mean())
# Checpoininting the model using val loss threshold
if torch.tensor(val_losses).float().mean() <= val_loss_min:
print("Epoch Val Loss Decreased... Saving model")
# save current model
torch.save(resnet.state_dict(), '/content/drive/MyDrive/1. Full Projects/Intel Image Classification/model_state.pt')
val_loss_min = torch.tensor(val_losses).mean()
# Step the scheduler for the next epoch
scheduler.step()
# Print the updated learning rate
print('Learning Rate Set To: {:.5f}'.format(optimizer.state_dict()['param_groups'][0]['lr']),'\n')
我看到
resnet.zero\u grad()
在logit=resnet(data)
之后,这会导致渐变在您的情况下爆炸
请按以下步骤操作:
# Clearing the gradient
optimizer.zero_grad()
logit = resnet(data)
# Compute loss
loss = criterion(logit, label)
您的培训样本中是否可能有
nan
?您可能会在@Shai no nans的数据文件夹中找到有用的信息,我也会检查链接,谢谢。可能是应用于数据的转换创建了nan
s。nan
是否总是在同一时间/迭代/纪元出现?如果你显著降低了学习率,会发生什么?@Shai我看到你在雷霍沃特,我是爱尔兰人,但我住在特拉维夫;)我不确定这是否是这里的问题(根据评论,这是一个很大的学习率)。好吧,你应该在向前传球之前或之后将梯度归零。想想看,他在向前传球时计算梯度……他在用清除梯度。归零梯度()
…这有意义吗?他向前,归零梯度,然后向后,一步。它看起来就像零度梯度一样,向前,向后,一步一步,只是降低lr有帮助,但我再次调高了lr。零度梯度()在向前传球之前,两个都起作用了…没有更多的南音。我从阅读中发现,无论如何,在向前传球之前打电话更安全。。。
# Clearing the gradient
optimizer.zero_grad()
logit = resnet(data)
# Compute loss
loss = criterion(logit, label)