Windows和WSL2中PyTorch的CUDA性能非常低
主机环境:Windows和WSL2中PyTorch的CUDA性能非常低,pytorch,windows-subsystem-for-linux,Pytorch,Windows Subsystem For Linux,主机环境: CPU:Intel Core-i7-7700HQ@2.8GHz 内存:16GB GPU:NVidia GeForce 1050Ti 操作系统:64位Windows Home 2004(20241.1005) CUDA:11.2 康达:4.8.4 Python:3.7.5 Pytorch:1.5.1 火炬文本:0.6.0 cudatoolkit:10.1 CuDNN:7 WSL2环境: 操作系统:Ubuntu 18.04 康达:4.8.4 Python:3.7.5 Pytorch:1.
CPU:Intel Core-i7-7700HQ@2.8GHz
内存:16GB
GPU:NVidia GeForce 1050Ti
操作系统:64位Windows Home 2004(20241.1005)
CUDA:11.2
康达:4.8.4
Python:3.7.5
Pytorch:1.5.1
火炬文本:0.6.0
cudatoolkit:10.1
CuDNN:7 WSL2环境:
操作系统:Ubuntu 18.04
康达:4.8.4
Python:3.7.5
Pytorch:1.5.1
火炬文本:0.6.0
cudatoolkit:10.1
CuDNN:7 码头工人环境:
码头工人:19.03.13
康达:4.8.4
Python:3.7.5
Pytorch:1.5.0
火炬文本:0.6.0
cudatoolkit:10.1
CuDNN:7 我在3个环境中测试了两个Pytorch lenet5训练脚本,其中一个使用CPU,另一个使用GPU。我很困惑GPU的性能不是很好,而且在Windows中工作要比在虚拟Linux和docker中工作慢得多。我不能放截图。请检查时间记录: Windows输出:
- cuda培训
第1纪元,损失1.7694,列车加速0.353,测试加速0.581,时间7.8秒
第二纪元,损失0.4776,列车速度0.628,测试速度0.687,时间6.7秒
第三纪元,损失0.2609,列车加速0.711,测试加速0.717,时间6.7秒
第4纪元,损失0.1738,列车速度0.738,测试速度0.743,时间6.6秒
第5纪元,损失0.1276,列车加速0.754,测试加速0.756,时间6.7秒 - cpu培训
第1纪元,损失1.8857,列车加速0.304,测试加速0.586,时间20.8秒
第二纪元,损失0.4707,列车按照0.634进行,测试按照0.676进行,时间为20.2秒
第三纪元,损失0.2549,列车按照0.719进行,测试按照0.719进行,时间为20.7秒
第四纪元,损失0.1698,列车速度0.743,测试速度0.741,时间21.4秒
第5纪元,损失0.1257,列车加速0.758,测试加速0.755,时间22.0秒
- cuda培训
第1纪元,损失1.8913,列车acc 0.305,测试acc 0.574,时间10.7秒
第二纪元,损失0.4851,列车按照0.619进行,测试按照0.665进行,时间为10.3秒
第三纪元,损失0.2610,列车速度0.710,测试速度0.729,时间9.9秒
第4纪元,损失0.1728,列车速度0.737,测试速度0.743,时间10.3秒
第5纪元,损失0.1285,列车加速0.751,测试加速0.749,时间10.2秒 - cpu培训
第1纪元,损失1.8504,列车加速0.322,测试加速0.583,时间6.5秒
第二纪元,损失0.4723,列车加速0.635,测试加速0.676,时间6.3秒
第三纪元,损失0.2590,列车速度0.713,测试速度0.723,时间5.6秒
第4纪元,损失0.1717,列车加速0.739,测试加速0.740,时间5.6秒
第5纪元,损失0.1268,列车速度0.754,测试速度0.750,时间5.6秒
- cuda培训
第1纪元,损失1.8288,列车加速0.325,测试加速0.588,时间11.0秒
第二纪元,损失0.4789,列车加速0.622,测试加速0.674,时间10.7秒
第三纪元,损失0.2598,列车速度0.713,测试速度0.727,时间10.6秒
第4纪元,损失0.1707,列车速度0.739,测试速度0.747,时间10.9秒
第5纪元,损失0.1251,列车依据0.757,测试依据0.760,时间10.7秒 - cpu培训
第1纪元,损失1.8938,列车加速0.302,测试加速0.561,时间5.8秒
第二纪元,损失0.4748,列车速度0.641,测试速度0.682,时间5.8秒
第三纪元,损失0.2510,列车速度0.718,测试速度0.728,时间5.7秒
第4纪元,损失0.1673,列车速度0.742,测试速度0.731,时间5.8秒
第5纪元,损失0.1236,列车依据0.758,测试依据0.757,时间6.1秒
- lenet5.py
- lenet5-cpu.py
这两个脚本的唯一区别在于第11行。设备与lenet5 CPU.py中的CPU分离。您发现问题了吗?
import os
import time
import torch
from torch import nn, optim
import sys
sys.path.append("..")
import d2lzh_pytorch as d2l
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(torch.__version__)
print(device)
class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
nn.Sigmoid(),
nn.MaxPool2d(2, 2), # kernel_size, stride
nn.Conv2d(6, 16, 5),
nn.Sigmoid(),
nn.MaxPool2d(2, 2)
)
self.fc = nn.Sequential(
nn.Linear(16*4*4, 120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, 10)
)
def forward(self, img):
feature = self.conv(img)
output = self.fc(feature.view(img.shape[0], -1))
return output
net = LeNet()
print(net)
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
def evaluate_accuracy(data_iter, net, device=None):
if device is None and isinstance(net, torch.nn.Module):
device = list(net.parameters())[0].device
acc_sum, n = 0.0, 0
with torch.no_grad():
for X, y in data_iter:
if isinstance(net, torch.nn.Module):
net.eval()
acc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()
net.train()
else:
if('is_training' in net.__code__.co_varnames):
acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item()
else:
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
n += y.shape[0]
return acc_sum / n
def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):
net = net.to(device)
print("training on ", device)
loss = torch.nn.CrossEntropyLoss()
batch_count = 0
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
for X, y in train_iter:
X = X.to(device)
y = y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
optimizer.zero_grad()
l.backward()
optimizer.step()
train_l_sum += l.cpu().item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
n += y.shape[0]
batch_count += 1
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
% (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))
if __name__ == "__main__":
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)
import os
import time
import torch
from torch import nn, optim
import sys
sys.path.append("..")
import d2lzh_pytorch as d2l
os.environ["CUDA_VISIBLE_DEVICaES"] = "0"
device = torch.device('cpu')
print(torch.__version__)
print(device)
class LeNet(nn.Module):
def __init__(self):
super(LeNet, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(1, 6, 5), # in_channels, out_channels, kernel_size
nn.Sigmoid(),
nn.MaxPool2d(2, 2), # kernel_size, stride
nn.Conv2d(6, 16, 5),
nn.Sigmoid(),
nn.MaxPool2d(2, 2)
)
self.fc = nn.Sequential(
nn.Linear(16*4*4, 120),
nn.Sigmoid(),
nn.Linear(120, 84),
nn.Sigmoid(),
nn.Linear(84, 10)
)
def forward(self, img):
feature = self.conv(img)
output = self.fc(feature.view(img.shape[0], -1))
return output
net = LeNet()
print(net)
batch_size = 256
train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size=batch_size)
def evaluate_accuracy(data_iter, net, device=None):
if device is None and isinstance(net, torch.nn.Module):
device = list(net.parameters())[0].device
acc_sum, n = 0.0, 0
with torch.no_grad():
for X, y in data_iter:
if isinstance(net, torch.nn.Module):
net.eval()
acc_sum += (net(X.to(device)).argmax(dim=1) == y.to(device)).float().sum().cpu().item()
net.train()
else:
if('is_training' in net.__code__.co_varnames):
acc_sum += (net(X, is_training=False).argmax(dim=1) == y).float().sum().item()
else:
acc_sum += (net(X).argmax(dim=1) == y).float().sum().item()
n += y.shape[0]
return acc_sum / n
def train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs):
net = net.to(device)
print("training on ", device)
loss = torch.nn.CrossEntropyLoss()
batch_count = 0
for epoch in range(num_epochs):
train_l_sum, train_acc_sum, n, start = 0.0, 0.0, 0, time.time()
for X, y in train_iter:
X = X.to(device)
y = y.to(device)
y_hat = net(X)
l = loss(y_hat, y)
optimizer.zero_grad()
l.backward()
optimizer.step()
train_l_sum += l.cpu().item()
train_acc_sum += (y_hat.argmax(dim=1) == y).sum().cpu().item()
n += y.shape[0]
batch_count += 1
test_acc = evaluate_accuracy(test_iter, net)
print('epoch %d, loss %.4f, train acc %.3f, test acc %.3f, time %.1f sec'
% (epoch + 1, train_l_sum / batch_count, train_acc_sum / n, test_acc, time.time() - start))
if __name__ == "__main__":
lr, num_epochs = 0.001, 5
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
train_ch5(net, train_iter, test_iter, batch_size, optimizer, device, num_epochs)