Pytorch 培训yolov5导致CUDNN\u状态\u未\u初始化错误

Pytorch 培训yolov5导致CUDNN\u状态\u未\u初始化错误,pytorch,yolo,yolov5,Pytorch,Yolo,Yolov5,我跟在后面,什么也没变。我正在使用一个aws服务器和深度学习ami:deep learning ami(Ubuntu18.04)版本40.0 我尝试将我的自定义数据集更改为coco数据集和自定义数据集的一小部分。 批量大小似乎并不重要,CUDA和其他驱动程序似乎可以工作 批处理启动培训过程时引发异常。这是完整的堆栈跟踪: Logging results to runs/train/exp66 Starting training for 5 epochs... Epoch gpu_

我跟在后面,什么也没变。我正在使用一个aws服务器和深度学习ami:deep learning ami(Ubuntu18.04)版本40.0

我尝试将我的自定义数据集更改为coco数据集和自定义数据集的一小部分。 批量大小似乎并不重要,CUDA和其他驱动程序似乎可以工作

批处理启动培训过程时引发异常。这是完整的堆栈跟踪:

Logging results to runs/train/exp66
Starting training for 5 epochs...

     Epoch   gpu_mem       box       obj       cls     total   targets  img_size
  0%|                                                                                                                                                                                                                 | 0/22 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 533, in <module>
    train(hyp, opt, device, tb_writer, wandb)
  File "train.py", line 298, in train
    pred = model(imgs)  # forward
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/yolov5/models/yolo.py", line 121, in forward
    return self.forward_once(x, profile)  # single-scale inference, train
  File "/home/ubuntu/yolov5/models/yolo.py", line 137, in forward_once
    x = m(x)  # run
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/yolov5/models/common.py", line 113, in forward
    return self.conv(torch.cat([x[..., ::2, ::2], x[..., 1::2, ::2], x[..., ::2, 1::2], x[..., 1::2, 1::2]], 1))
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ubuntu/yolov5/models/common.py", line 38, in forward
    return self.act(self.bn(self.conv(x)))
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 399, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "/usr/local/lib/python3.8/site-packages/torch/nn/modules/conv.py", line 395, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: cuDNN error: CUDNN_STATUS_NOT_INITIALIZED
runs/train/exp66的日志记录结果
开始5个时代的训练。。。
Epoch gpu内存盒obj cls总目标img\U大小

0%| | 0/22 [00:00我使用conda修复了它,我克隆了图像附带的pytorch环境,它工作得非常完美。但我仍然不知道原因。

我不知道为什么,但似乎torch 1.8是基于旧版本的cuda构建的。 此外,由于pytorch有自己的cuda,它似乎并不关心您的机器上有什么版本。 更改torch版本(并匹配兼容的tochvision)解决了我的问题

就我而言,我的做法如下:

  • 更改了“requirements.txt”中的两行:
  • 火炬==1.7.1

    火炬视野==0.8.2

  • 使用python=3.8创建新的conda环境
  • 激活环境
  • 已修改文件中的安装要求:
  • $pip安装-r requirements.txt

    希望对某人有所帮助:)