Deep learning PYMNIST示例不收敛

Deep learning PYMNIST示例不收敛,deep-learning,neural-network,pytorch,conv-neural-network,Deep Learning,Neural Network,Pytorch,Conv Neural Network,我是pytorch的新手,我正在编写一个执行MNIST分类的玩具示例。以下是我的示例的完整代码: import matplotlib matplotlib.use("Agg") import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from torch.utils.data import DataLoader import torchvision.transf

我是pytorch的新手,我正在编写一个执行MNIST分类的玩具示例。以下是我的示例的完整代码:

import matplotlib
matplotlib.use("Agg")
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader

import torchvision.transforms as transforms
import torchvision.datasets as datasets

import matplotlib.pyplot as plt
import os
from os import system, listdir
from os.path import join, isfile, isdir, dirname

def img_transform(image):
    transform=transforms.Compose([
        # transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))])
    return transform(image)


def normalize_output(img):
    img = img - img.min()
    img = img / img.max()
    return img

def save_checkpoint(state, filename='checkpoint.pth.tar'):
    torch.save(state, filename)

class Net(nn.Module):
    """docstring for Net"""
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.max_pool2d(x, 2)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output

os.environ['CUDA_VISIBLE_DEVICES'] = '0'
data_images, data_labels = torch.load("./PATH/MNIST/processed/training.pt")
model = Net()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=1e-2)
epochs = 5
batch_size = 30
num_batch = int(data_images.shape[0] / batch_size)
for epoch in range(epochs):
    for batch_idx in range(num_batch):
        data = data_images[ batch_idx*batch_size : (batch_idx+1)*batch_size ].float()
        label = data_labels[ batch_idx*batch_size : (batch_idx+1)*batch_size ]
        data = img_transform(data)
        data = data.unsqueeze_(1)
        pred_score = model(data)
        loss = criterion(pred_score, label)
        loss.backward()
        optimizer.step()
        if batch_idx % 200 == 0:
            print('epoch', epoch, batch_idx, '/', num_batch, 'loss', loss.item())
            _, pred = pred_score.topk(1)
            pred = pred.t().squeeze()
            correct = pred.eq(label)
            num_correct = correct.sum(0).item()
            print('acc=', num_correct/batch_size)

dict_to_save = {
    'epoch': epochs,
    'state_dict': model.state_dict(),
    'optimizer' : optimizer.state_dict(),
    }
ckpt_file = 'a.pth.tar'
save_checkpoint(dict_to_save, ckpt_file)
print('save to ckpt_file', ckpt_file)
exit()
代码可通过保存在路径
/path/MNIST/processed/training.pt

然而,训练过程并不收敛,训练精度始终低于0.2。我的实现有什么问题?我尝试了不同的学习率和批量大小。它不起作用

我的代码中还有其他问题吗

谢谢大家帮助我

以下是一些培训日志

epoch 0 0 / 2000 loss 27.2023868560791
acc= 0.1
epoch 0 200 / 2000 loss 2.3346288204193115
acc= 0.13333333333333333
epoch 0 400 / 2000 loss 2.691042900085449
acc= 0.13333333333333333
epoch 0 600 / 2000 loss 2.6452369689941406
acc= 0.06666666666666667
epoch 0 800 / 2000 loss 2.7910964488983154
acc= 0.13333333333333333
epoch 0 1000 / 2000 loss 2.966330051422119
acc= 0.1
epoch 0 1200 / 2000 loss 3.111387014389038
acc= 0.06666666666666667
epoch 0 1400 / 2000 loss 3.1988155841827393
acc= 0.03333333333333333

我发现至少有四个问题会影响您获得的结果:

1) 您需要将梯度归零,例如:

optimizer.zero\u grad()
loss.backward()
optimizer.step()
2) 您正在使用
F.softmax
输入
nn.CrossEntropyLoss()
。它需要登录。删除此项:

output = F.log_softmax(x, dim=1)
3) 打印时,仅计算当前批次的损耗和acc。所以,这不是正确的结果。要解决此问题,您需要存储所有损失/账目,并在打印前计算平均值,例如:

# During the loop
loss_value += loss.item()

# When printing:
print(loss_value/number_of_batch_losses_stored)
4) 这不是一个大问题,但我认为这个学习率应该更低,例如:
1e-3

作为改进管道的提示,最好使用
DataLoader
加载数据。查看
torch.utils.data
了解如何执行此操作。以您现在的方式加载批处理是不高效的,因为您没有使用生成器。此外,MNIST已在torchvision.datasets.MNIST上提供。如果从那里加载数据,可以节省一些时间


我希望这会有所帮助。

在您的培训循环中,您应该添加
model.train()
此外,尝试使用
NLLLoss
添加您的第4点),通常最佳做法是暂时降低学习率。将其设置得太小可能和设置得太大一样困难,但通常
1e-3
被认为是一个很好的经验法则。@Denninger完全同意!