Gpu pytorch中的推理时间和TFLOPS_Gpu_Pytorch_Profiler_Inference

Gpu pytorch中的推理时间和TFLOPS

pytorch

Gpu pytorch中的推理时间和TFLOPS,gpu,pytorch,profiler,inference,Gpu,Pytorch,Profiler,Inference,我目前正在研究使用torch.autograd.profiler和两个不同GPU的不同CNN模型的半精度推断时间： Nvidia RTX 2080 Ti（26.90 TFLOPS）-本地完成（更好的CPU） Nvidia T4（65.13 TFLOPS）-在云端完成令我惊讶的是，2080 Ti的速度大大加快（时间的一半或更少），与批量大小、输入分辨率和体系结构无关，即使它的TFLOP小于一半有人知道为什么吗 import torch import segmentation_models_

我目前正在研究使用torch.autograd.profiler和两个不同GPU的不同CNN模型的半精度推断时间：

Nvidia RTX 2080 Ti（26.90 TFLOPS）-本地完成（更好的CPU）
Nvidia T4（65.13 TFLOPS）-在云端完成

令我惊讶的是，2080 Ti的速度大大加快（时间的一半或更少），与批量大小、输入分辨率和体系结构无关，即使它的TFLOP小于一半

有人知道为什么吗

import torch
import segmentation_models_pytorch as smp # pip install git+https://github.com/qubvel/segmentation_models.pytorch

runs = 10
res = 512
bs = 8
is_half = True

m = smp.Unet(encoder_name='resnet101', encoder_weights=None)
m.eval()
m.cuda()
m.half()

t = torch.rand((bs, 3, res, res)).cuda()
t = t.half()

if is_half:
    m.half()
    t = t.half()

# warm up
with torch.no_grad():
    m(t)

cpu_time_ms = 0
cuda_time_ms = 0
for i in range(runs):
    with torch.no_grad():
        with torch.autograd.profiler.profile(use_cuda=True) as prof:
            m(t)
        cpu_time_ms += prof.self_cpu_time_total / 1000
        cuda_time_ms += sum([evt.cuda_time_total for evt in prof.key_averages()]) / 1000

cpu_time_ms /= runs * bs
cuda_time_ms /= runs * bs

print('res={}x{} cuda={:.1f}ms cpu={:.1f}ms'.format(res, res, cuda_time_ms, cpu_time_ms))

例如：

unet with resnet101 as backbone and batch size 8

t4
res=128x128 cuda=11.3ms cpu=3.0ms
res=256x256 cuda=14.5ms cpu=2.8ms
res=512x512 cuda=50.4ms cpu=7.3ms

rtx 2080 ti
res=128x128 cuda=7.5ms cpu=1.7ms
res=256x256 cuda=8.6ms cpu=1.8ms
res=512x512 cuda=21.1ms cpu=3.0ms

如果你能提供完整的脚本，让其他人也能运行它，那就太好了。@Berriel添加了完整的cude