pytorch带GPU,CUBLAS_状态_执行_失败错误,如何跟踪或修复?

pytorch带GPU,CUBLAS_状态_执行_失败错误,如何跟踪或修复?,pytorch,gpu,Pytorch,Gpu,我是GPU培训和pytorch的新手 执行NMT代码时,出现CUBLAS_状态_执行失败错误。(当然,有了CPU,它表现得很好。)我知道,我的GPU不是很出色,但每次都有一个批大小的错误发生 请帮助我修复或跟踪此问题 Epoch [1/30]: 1%|▍ Epoch [1/30]: 1%| | 1/166 [00:00<

我是GPU培训和pytorch的新手

执行NMT代码时,出现CUBLAS_状态_执行失败错误。(当然,有了CPU,它表现得很好。)我知道,我的GPU不是很出色,但每次都有一个批大小的错误发生

请帮助我修复或跟踪此问题

Epoch [1/30]:   1%|▍                                                                    Epoch [1/30]:   1%|                 | 1/166 [00:00<?, ?it/s, loss=7.72, ppl=2.24e+3, |g_Epoch [1/30]:   1%|                 | 1/166 [00:00<?, ?it/s, loss=7.72, ppl=2.24e+3, |g_Epoch [1/30]:   1%|                 | 1/166 [00:00<?, ?it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]:   1%|         | 2/166 [00:00<00:22,  7.43it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]:   1%|         | 2/166 [00:00<00:22,  7.39it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]:   2%|▏        | 3/166 [00:00<00:11, 14.74it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]:   2%|▏        | 3/166 [00:00<00:11, 14.74it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]:   2%|▏        | 3/166 [00:00<00:11, 14.74it/s, loss=7.72, ppl=2.26e+3, |g_Epoch [1/30]:   2%|▏        | 4/166 [00:00<00:10, 14.74it/s, loss=7.72, ppl=2.26e+3, |g_Epoch [1/30]:   2%|▏        | 4/166 [00:00<00:10, 14.74it/s, loss=7.72, ppl=2.26e+3, |g_Epoch [1/30]:   3%|▎        | 5/166 [00:00<00:13, 12.01it/s, loss=7.72, ppl=2.26e+3, |g_param|=4.27e+5, |param|=1.29e+3]
我的GPU是

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P620         On   | 00000000:81:00.0 Off |                  N/A |
| 34%   42C    P8    N/A /  N/A |     19MiB /  1999MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2662      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      3026      G   /usr/bin/gnome-shell                4MiB |
+-----------------------------------------------------------------------------+
使用“CUBLAS_STATUS_EXECUTION_FAILED”关键字,有很多建议将CUDA版本升级到11.2。然而,它已经是11.2了

GPU内存使用率

| 34%   41C    P8    N/A /  N/A |     19MiB /  1999MiB |      0%      Default |
| 34%   41C    P0    N/A /  N/A |     64MiB /  1999MiB |      2%      Default |
| 34%   42C    P0    N/A /  N/A |    202MiB /  1999MiB |      4%      Default |
| 34%   43C    P0    N/A /  N/A |    346MiB /  1999MiB |      3%      Default |
| 34%   43C    P0    N/A /  N/A |    498MiB /  1999MiB |      3%      Default |
| 34%   43C    P0    N/A /  N/A |    632MiB /  1999MiB |      3%      Default |
| 34%   43C    P0    N/A /  N/A |    850MiB /  1999MiB |      2%      Default |
| 34%   44C    P0    N/A /  N/A |    962MiB /  1999MiB |     47%      Default |
| 34%   44C    P0    N/A /  N/A |    962MiB /  1999MiB |     45%      Default |
| 34%   45C    P0    N/A /  N/A |     19MiB /  1999MiB |     19%      Default |

我将cuda toolkit降级为11.1,但结果相同。最后,我发现了一条关于类似问题的评论()最后,似乎主要问题是我的GPU(pascal)的规格
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P620         On   | 00000000:81:00.0 Off |                  N/A |
| 34%   42C    P8    N/A /  N/A |     19MiB /  1999MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      2662      G   /usr/lib/xorg/Xorg                  9MiB |
|    0   N/A  N/A      3026      G   /usr/bin/gnome-shell                4MiB |
+-----------------------------------------------------------------------------+
| 34%   41C    P8    N/A /  N/A |     19MiB /  1999MiB |      0%      Default |
| 34%   41C    P0    N/A /  N/A |     64MiB /  1999MiB |      2%      Default |
| 34%   42C    P0    N/A /  N/A |    202MiB /  1999MiB |      4%      Default |
| 34%   43C    P0    N/A /  N/A |    346MiB /  1999MiB |      3%      Default |
| 34%   43C    P0    N/A /  N/A |    498MiB /  1999MiB |      3%      Default |
| 34%   43C    P0    N/A /  N/A |    632MiB /  1999MiB |      3%      Default |
| 34%   43C    P0    N/A /  N/A |    850MiB /  1999MiB |      2%      Default |
| 34%   44C    P0    N/A /  N/A |    962MiB /  1999MiB |     47%      Default |
| 34%   44C    P0    N/A /  N/A |    962MiB /  1999MiB |     45%      Default |
| 34%   45C    P0    N/A /  N/A |     19MiB /  1999MiB |     19%      Default |