Warning: file_get_contents(/data/phpspider/zhask/data//catemap/9/loops/2.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Loops model.train()和model.eval()生成nan值_Loops_Validation_Neural Network_Pytorch_Nan - Fatal编程技术网

Loops model.train()和model.eval()生成nan值

Loops model.train()和model.eval()生成nan值,loops,validation,neural-network,pytorch,nan,Loops,Validation,Neural Network,Pytorch,Nan,嘿,所以我正在尝试我的手在图像分类/转移学习使用猴子物种数据集和resnet50与修改后的最后一个fc层预测只是10类。在我使用model.train()和model.eval()之前,一切都是正常的,但在第一个历元之后,它开始返回nans,并且精度会下降,如下所示。我很好奇,为什么只有在切换到列车/评估时才会出现这种情况 首先,我导入模型并附加分类器并冻结参数 %%capture resnet = models.resnet50(pretrained=True) for param in r

嘿,所以我正在尝试我的手在图像分类/转移学习使用猴子物种数据集和resnet50与修改后的最后一个fc层预测只是10类。在我使用model.train()和model.eval()之前,一切都是正常的,但在第一个历元之后,它开始返回nans,并且精度会下降,如下所示。我很好奇,为什么只有在切换到列车/评估时才会出现这种情况

首先,我导入模型并附加分类器并冻结参数

%%capture
resnet = models.resnet50(pretrained=True)

for param in resnet.parameters():
  param.required_grad = False

in_features = resnet.fc.in_features


# Build custom classifier
classifier = nn.Sequential(OrderedDict([('fc1', nn.Linear(in_features, 512)),
                                        ('relu', nn.ReLU()),
                                        ('drop', nn.Dropout(0.05)),
                                        ('fc2', nn.Linear(512, 10)),
                                        ]))

# ('output', nn.LogSoftmax(dim=1))
resnet.classifier = classifier

resnet.to(device)
然后设置我的损失函数、优化器和shceduler

# Step : Define criterion and optimizer
criterion = nn.CrossEntropyLoss()
# pass the optimizer to the appended classifier layer
optimizer = torch.optim.SGD(resnet.parameters(), lr=0.01)
# Scheduler
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[10], gamma=0.05)  
然后设置培训和验证循环

epochs = 20


tr_losses = []
avg_epoch_tr_loss = []
tr_accuracy = []


val_losses = []
avg_epoch_val_loss = []
val_accuracy = []
val_loss_min = np.Inf


resnet.train()
for epoch in range(epochs):
  for i, batch in enumerate(train_loader):
    # Pull the data and labels from the batch
    data, label = batch
    # If available push data and label to GPU
    if train_on_gpu:
      data, label = data.to(device), label.to(device)
    # Compute the logit
    logit = resnet(data)
    # Compte loss
    loss = criterion(logit, label)
    # Clearing the gradient
    resnet.zero_grad()
    # Backpropagate the gradients (accumulte the partial derivatives of loss)
    loss.backward()
    # Apply the updates to the optimizer step in the opposite direction to the gradient
    optimizer.step()
    # Store the losses of each batch
    # loss.item() seperates the loss from comp graph
    tr_losses.append(loss.item())
    # Detach and store the average accuracy of each batch
    tr_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
    # Print the rolling batch training loss every 20 batches
    if i % 40 == 0 and not i == 1:
      print(f'Batch No: {i} \tAverage Training Batch Loss: {torch.tensor(tr_losses).mean():.2f}')
  # Print the average loss for each epoch
  print(f'\nEpoch No: {epoch + 1},Training Loss: {torch.tensor(tr_losses).mean():.2f}')
  # Print the average accuracy for each epoch
  print(f'Epoch No: {epoch + 1}, Training Accuracy: {torch.tensor(tr_accuracy).mean():.2f}\n')
  # Store the avg epoch loss for plotting
  avg_epoch_tr_loss.append(torch.tensor(tr_losses).mean())


  resnet.eval()
  for i, batch in enumerate(val_loader):
    # Pull the data and labels from the batch
    data, label = batch
    # If available push data and label to GPU
    if train_on_gpu:
      data, label = data.to(device), label.to(device)
    # Compute the logits without computing the gradients
    with torch.no_grad():
      logit = resnet(data)
    # Compte loss
    loss = criterion(logit, label)
    # Store test loss
    val_losses.append(loss.item())
    # Store the accuracy for each batch
    val_accuracy.append(label.eq(logit.argmax(dim=1)).float().mean())
    if i % 20 == 0 and not i == 1:
      print(f'Batch No: {i+1} \tAverage Val Batch Loss: {torch.tensor(val_losses).mean():.2f}')
  # Print the average loss for each epoch
  print(f'\nEpoch No: {epoch + 1}, Epoch Val Loss: {torch.tensor(val_losses).mean():.2f}')
  # Print the average accuracy for each epoch    
  print(f'Epoch No: {epoch + 1}, Epoch Val Accuracy: {torch.tensor(val_accuracy).mean():.2f}\n')
  # Store the avg epoch loss for plotting
  avg_epoch_val_loss.append(torch.tensor(val_losses).mean())

  # Checpoininting the model using val loss threshold
  if torch.tensor(val_losses).float().mean() <= val_loss_min:
    print("Epoch Val Loss Decreased... Saving model")
    # save current model
    torch.save(resnet.state_dict(), '/content/drive/MyDrive/1. Full Projects/Intel Image Classification/model_state.pt')
    val_loss_min = torch.tensor(val_losses).mean()
  # Step the scheduler for the next epoch
  scheduler.step()
  # Print the updated learning rate
  print('Learning Rate Set To: {:.5f}'.format(optimizer.state_dict()['param_groups'][0]['lr']),'\n')

我看到
resnet.zero\u grad()
logit=resnet(data)
之后,这会导致渐变在您的情况下爆炸

请按以下步骤操作:

# Clearing the gradient
optimizer.zero_grad()
logit = resnet(data)

# Compute loss
loss = criterion(logit, label)

您的培训样本中是否可能有
nan
?您可能会在@Shai no nans的数据文件夹中找到有用的信息,我也会检查链接,谢谢。可能是应用于数据的转换创建了
nan
s。
nan
是否总是在同一时间/迭代/纪元出现?如果你显著降低了学习率,会发生什么?@Shai我看到你在雷霍沃特,我是爱尔兰人,但我住在特拉维夫;)我不确定这是否是这里的问题(根据评论,这是一个很大的学习率)。好吧,你应该在向前传球之前或之后将梯度归零。想想看,他在向前传球时计算梯度……他在用
清除梯度。归零梯度()
…这有意义吗?他向前,归零梯度,然后向后,一步。它看起来就像零度梯度一样,向前,向后,一步一步,只是降低lr有帮助,但我再次调高了lr。零度梯度()在向前传球之前,两个都起作用了…没有更多的南音。我从阅读中发现,无论如何,在向前传球之前打电话更安全。。。
# Clearing the gradient
optimizer.zero_grad()
logit = resnet(data)

# Compute loss
loss = criterion(logit, label)