pytorch带GPU,CUBLAS_状态_执行_失败错误,如何跟踪或修复?
我是GPU培训和pytorch的新手 执行NMT代码时,出现CUBLAS_状态_执行失败错误。(当然,有了CPU,它表现得很好。)我知道,我的GPU不是很出色,但每次都有一个批大小的错误发生 请帮助我修复或跟踪此问题pytorch带GPU,CUBLAS_状态_执行_失败错误,如何跟踪或修复?,pytorch,gpu,Pytorch,Gpu,我是GPU培训和pytorch的新手 执行NMT代码时,出现CUBLAS_状态_执行失败错误。(当然,有了CPU,它表现得很好。)我知道,我的GPU不是很出色,但每次都有一个批大小的错误发生 请帮助我修复或跟踪此问题 Epoch [1/30]: 1%|▍ Epoch [1/30]: 1%| | 1/166 [00:00<
Epoch [1/30]: 1%|▍ Epoch [1/30]: 1%| | 1/166 [00:00<?, ?it/s, loss=7.72, ppl=2.24e+3, |g_Epoch [1/30]: 1%| | 1/166 [00:00<?, ?it/s, loss=7.72, ppl=2.24e+3, |g_Epoch [1/30]: 1%| | 1/166 [00:00<?, ?it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]: 1%| | 2/166 [00:00<00:22, 7.43it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]: 1%| | 2/166 [00:00<00:22, 7.39it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]: 2%|▏ | 3/166 [00:00<00:11, 14.74it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]: 2%|▏ | 3/166 [00:00<00:11, 14.74it/s, loss=7.72, ppl=2.25e+3, |g_Epoch [1/30]: 2%|▏ | 3/166 [00:00<00:11, 14.74it/s, loss=7.72, ppl=2.26e+3, |g_Epoch [1/30]: 2%|▏ | 4/166 [00:00<00:10, 14.74it/s, loss=7.72, ppl=2.26e+3, |g_Epoch [1/30]: 2%|▏ | 4/166 [00:00<00:10, 14.74it/s, loss=7.72, ppl=2.26e+3, |g_Epoch [1/30]: 3%|▎ | 5/166 [00:00<00:13, 12.01it/s, loss=7.72, ppl=2.26e+3, |g_param|=4.27e+5, |param|=1.29e+3]
我的GPU是
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P620 On | 00000000:81:00.0 Off | N/A |
| 34% 42C P8 N/A / N/A | 19MiB / 1999MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2662 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3026 G /usr/bin/gnome-shell 4MiB |
+-----------------------------------------------------------------------------+
使用“CUBLAS_STATUS_EXECUTION_FAILED”关键字,有很多建议将CUDA版本升级到11.2。然而,它已经是11.2了
GPU内存使用率
| 34% 41C P8 N/A / N/A | 19MiB / 1999MiB | 0% Default |
| 34% 41C P0 N/A / N/A | 64MiB / 1999MiB | 2% Default |
| 34% 42C P0 N/A / N/A | 202MiB / 1999MiB | 4% Default |
| 34% 43C P0 N/A / N/A | 346MiB / 1999MiB | 3% Default |
| 34% 43C P0 N/A / N/A | 498MiB / 1999MiB | 3% Default |
| 34% 43C P0 N/A / N/A | 632MiB / 1999MiB | 3% Default |
| 34% 43C P0 N/A / N/A | 850MiB / 1999MiB | 2% Default |
| 34% 44C P0 N/A / N/A | 962MiB / 1999MiB | 47% Default |
| 34% 44C P0 N/A / N/A | 962MiB / 1999MiB | 45% Default |
| 34% 45C P0 N/A / N/A | 19MiB / 1999MiB | 19% Default |
…我将cuda toolkit降级为11.1,但结果相同。最后,我发现了一条关于类似问题的评论()最后,似乎主要问题是我的GPU(pascal)的规格
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03 Driver Version: 460.32.03 CUDA Version: 11.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Quadro P620 On | 00000000:81:00.0 Off | N/A |
| 34% 42C P8 N/A / N/A | 19MiB / 1999MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2662 G /usr/lib/xorg/Xorg 9MiB |
| 0 N/A N/A 3026 G /usr/bin/gnome-shell 4MiB |
+-----------------------------------------------------------------------------+
| 34% 41C P8 N/A / N/A | 19MiB / 1999MiB | 0% Default |
| 34% 41C P0 N/A / N/A | 64MiB / 1999MiB | 2% Default |
| 34% 42C P0 N/A / N/A | 202MiB / 1999MiB | 4% Default |
| 34% 43C P0 N/A / N/A | 346MiB / 1999MiB | 3% Default |
| 34% 43C P0 N/A / N/A | 498MiB / 1999MiB | 3% Default |
| 34% 43C P0 N/A / N/A | 632MiB / 1999MiB | 3% Default |
| 34% 43C P0 N/A / N/A | 850MiB / 1999MiB | 2% Default |
| 34% 44C P0 N/A / N/A | 962MiB / 1999MiB | 47% Default |
| 34% 44C P0 N/A / N/A | 962MiB / 1999MiB | 45% Default |
| 34% 45C P0 N/A / N/A | 19MiB / 1999MiB | 19% Default |