Windows和WSL2中PyTorch的CUDA性能非常低

Windows和WSL2中PyTorch的CUDA性能非常低,pytorch,windows-subsystem-for-linux,Pytorch,Windows Subsystem For Linux,主机环境: CPU:Intel Core-i7-7700HQ@2.8GHz 内存:16GB GPU:NVidia GeForce 1050Ti 操作系统:64位Windows Home 2004(20241.1005) CUDA:11.2 康达:4.8.4 Python:3.7.5 Pytorch:1.5.1 火炬文本:0.6.0 cudatoolkit:10.1 CuDNN:7 WSL2环境: 操作系统:Ubuntu 18.04 康达:4.8.4 Python:3.7.5 Pytorch:1.

主机环境:
CPU:Intel Core-i7-7700HQ@2.8GHz
内存:16GB
GPU:NVidia GeForce 1050Ti
操作系统:64位Windows Home 2004(20241.1005)
CUDA:11.2
康达:4.8.4
Python:3.7.5
Pytorch:1.5.1
火炬文本:0.6.0
cudatoolkit:10.1
CuDNN:7

WSL2环境:
操作系统:Ubuntu 18.04
康达:4.8.4
Python:3.7.5
Pytorch:1.5.1
火炬文本:0.6.0
cudatoolkit:10.1
CuDNN:7

码头工人环境:
码头工人:19.03.13
康达:4.8.4
Python:3.7.5
Pytorch:1.5.0
火炬文本:0.6.0
cudatoolkit:10.1
CuDNN:7

我在3个环境中测试了两个Pytorch lenet5训练脚本,其中一个使用CPU,另一个使用GPU。我很困惑GPU的性能不是很好,而且在Windows中工作要比在虚拟Linux和docker中工作慢得多。我不能放截图。请检查时间记录:

Windows输出:

  • cuda培训
    第1纪元,损失1.7694,列车加速0.353,测试加速0.581,时间7.8秒
    第二纪元,损失0.4776,列车速度0.628,测试速度0.687,时间6.7秒
    第三纪元,损失0.2609,列车加速0.711,测试加速0.717,时间6.7秒
    第4纪元,损失0.1738,列车速度0.738,测试速度0.743,时间6.6秒
    第5纪元,损失0.1276,列车加速0.754,测试加速0.756,时间6.7秒

  • cpu培训
    第1纪元,损失1.8857,列车加速0.304,测试加速0.586,时间20.8秒
    第二纪元,损失0.4707,列车按照0.634进行,测试按照0.676进行,时间为20.2秒
    第三纪元,损失0.2549,列车按照0.719进行,测试按照0.719进行,时间为20.7秒
    第四纪元,损失0.1698,列车速度0.743,测试速度0.741,时间21.4秒
    第5纪元,损失0.1257,列车加速0.758,测试加速0.755,时间22.0秒

在WSL2 Ubuntu中

  • cuda培训
    第1纪元,损失1.8913,列车acc 0.305,测试acc 0.574,时间10.7秒
    第二纪元,损失0.4851,列车按照0.619进行,测试按照0.665进行,时间为10.3秒
    第三纪元,损失0.2610,列车速度0.710,测试速度0.729,时间9.9秒
    第4纪元,损失0.1728,列车速度0.737,测试速度0.743,时间10.3秒
    第5纪元,损失0.1285,列车加速0.751,测试加速0.749,时间10.2秒

  • cpu培训
    第1纪元,损失1.8504,列车加速0.322,测试加速0.583,时间6.5秒
    第二纪元,损失0.4723,列车加速0.635,测试加速0.676,时间6.3秒
    第三纪元,损失0.2590,列车速度0.713,测试速度0.723,时间5.6秒
    第4纪元,损失0.1717,列车加速0.739,测试加速0.740,时间5.6秒
    第5纪元,损失0.1268,列车速度0.754,测试速度0.750,时间5.6秒

在WSL Ubuntu Docker中

  • cuda培训
    第1纪元,损失1.8288,列车加速0.325,测试加速0.588,时间11.0秒
    第二纪元,损失0.4789,列车加速0.622,测试加速0.674,时间10.7秒
    第三纪元,损失0.2598,列车速度0.713,测试速度0.727,时间10.6秒
    第4纪元,损失0.1707,列车速度0.739,测试速度0.747,时间10.9秒
    第5纪元,损失0.1251,列车依据0.757,测试依据0.760,时间10.7秒

  • cpu培训
    第1纪元,损失1.8938,列车加速0.302,测试加速0.561,时间5.8秒
    第二纪元,损失0.4748,列车速度0.641,测试速度0.682,时间5.8秒
    第三纪元,损失0.2510,列车速度0.718,测试速度0.728,时间5.7秒
    第4纪元,损失0.1673,列车速度0.742,测试速度0.731,时间5.8秒
    第5纪元,损失0.1236,列车依据0.758,测试依据0.757,时间6.1秒

用于测试的两个python脚本:

  • lenet5.py
  • lenet5-cpu.py

这两个脚本的唯一区别在于第11行。设备与lenet5 CPU.py中的CPU分离。

您发现问题了吗?
import os
import time
import torch
from torch import nn, optim

import sys
sys.path.append("..") 
import d2lzh_pytorch as d2l

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print(torch.__version__)
print(device)


class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2)
        )
        self.fc = nn.Sequential(
            nn.Linear(16*4*4, 120),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

    def forward(self, img):
        feature = self.conv(img)
        output = self.fc(feature.view(img.shape[0], -1))
        return output

    
net = LeNet()
print(net)


batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)


def evaluate_accuracy(data_iter, net, device=None):
    if device is None and isinstance(net, torch.nn.Module):
        device = list(net.parameters())[0].device
    acc_sum, n = 0.0, 0
    with torch.no_grad():
        for X, y in data_iter:
            if isinstance(net, torch.nn.Module):
                net.eval() 
                acc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()
                net.train()
            else: 
                if('is_training' in net.__code__.co_varnames): 
                    
                    acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item() 
                else:
                    acc_sum += (net(X).argmax(dim=1) == y).float().sum().item() 
            n += y.shape[0]
    return acc_sum / n


def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):
    net = net.to(device)
    print("training on ", device)
    loss = torch.nn.CrossEntropyLoss()
    batch_count = 0
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
        for X, y in train_iter:
            X = X.to(device)
            y = y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            train_l_sum += l.cpu().item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
            n += y.shape[0]
            batch_count += 1
        test_acc = evaluate_accuracy(test_iter, net)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
              % (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))


if __name__ == "__main__":
    lr, num_epochs = 0.001, 5
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
import os
import time
import torch
from torch import nn, optim

import sys
sys.path.append("..") 
import d2lzh_pytorch as d2l

os.environ["CUDA_VISIBLE_DEVICaES"] = "0"
device = torch.device('cpu')

print(torch.__version__)
print(device)


class LeNet(nn.Module):
    def __init__(self):
        super(LeNet, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2), # kernel_size, stride
            nn.Conv2d(6, 16, 5),
            nn.Sigmoid(),
            nn.MaxPool2d(2, 2)
        )
        self.fc = nn.Sequential(
            nn.Linear(16*4*4, 120),
            nn.Sigmoid(),
            nn.Linear(120, 84),
            nn.Sigmoid(),
            nn.Linear(84, 10)
        )

    def forward(self, img):
        feature = self.conv(img)
        output = self.fc(feature.view(img.shape[0], -1))
        return output

    
net = LeNet()
print(net)


batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)


def evaluate_accuracy(data_iter, net, device=None):
    if device is None and isinstance(net, torch.nn.Module):
        device = list(net.parameters())[0].device
    acc_sum, n = 0.0, 0
    with torch.no_grad():
        for X, y in data_iter:
            if isinstance(net, torch.nn.Module):
                net.eval() 
                acc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()
                net.train() 
            else: 
                if('is_training' in net.__code__.co_varnames): 
                    acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item() 
                else:
                    acc_sum += (net(X).argmax(dim=1) == y).float().sum().item() 
            n += y.shape[0]
    return acc_sum / n


def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):
    net = net.to(device)
    print("training on ", device)
    loss = torch.nn.CrossEntropyLoss()
    batch_count = 0
    for epoch in range(num_epochs):
        train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
        for X, y in train_iter:
            X = X.to(device)
            y = y.to(device)
            y_hat = net(X)
            l = loss(y_hat, y)
            optimizer.zero_grad()
            l.backward()
            optimizer.step()
            train_l_sum += l.cpu().item()
            train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
            n += y.shape[0]
            batch_count += 1
        test_acc = evaluate_accuracy(test_iter, net)
        print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
              % (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))


if __name__ == "__main__":
    lr, num_epochs = 0.001, 5
    optimizer = torch.optim.Adam(net.parameters(), lr=lr)
    train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)