Pytorch 如何解决著名的“未处理的cuda错误,NCCL版本2.7.8”错误?
我已经看到了多个关于:Pytorch 如何解决著名的“未处理的cuda错误,NCCL版本2.7.8”错误?,pytorch,distributed-computing,distributed-system,Pytorch,Distributed Computing,Distributed System,我已经看到了多个关于: RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8 ncclUnhandledCudaError: Call to CUDA function failed. 但似乎没有人能帮我解决这个问题: 我
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
但似乎没有人能帮我解决这个问题:
torch.cuda.set\u device(device)
。这似乎对我不起作用。我试过不同的GPU。我试着降低pytorch版本和cuda版本的等级。1.6.0、1.7.1、1.8.0和cuda 10.2、11.0、11.1的不同组合。我不知道还能做什么。人们做了什么来解决这个问题
可能很相关吧
更完整的错误消息:
('jobid', 4852)
('slurm_jobid', -1)
('slurm_array_task_id', -1)
('condor_jobid', 4852)
('current_time', 'Mar25_16-27-35')
('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
('gpu_name', 'GeForce GTX TITAN X')
('PID', '30688')
torch.cuda.device_count()=2
opts.world_size=2
ABOUT TO SPAWN WORKERS
done setting sharing strategy...next mp.spawn
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
rank=0
mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started>
os.getpid()=30704
setting up rank=0 (with world_size=2)
MASTER_ADDR='127.0.0.1'
59264
backend='nccl'
--> done setting up rank=0
setup process done for rank=0
Traceback (most recent call last):
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
main_distributed()
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train
tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)
File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp
model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
('jobid',4852)
('slurm_jobid',-1)
('slurm_array_task_id',-1)
('condor_jobid',4852)
('current_time'、'Mar25_16-27-35')
('tb_dir',PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
(‘gpu_名称’、‘GeForce GTX TITAN X’)
(‘PID’、‘30688’)
torch.cuda.device_count()=2
选择世界大小=2
即将产生工人
完成共享策略设置…下一个mp.spawn
信息:根目录:添加的关键字:基于存储的\u屏障\u关键字:1以存储排名:1
信息:根目录:已添加关键字:存储基于\u的\u屏障\u关键字:1以存储排名:0
排名=0
mp.当前_流程()=
os.getpid()=30704
设置排名=0(世界排名=2)
MASTER_ADDR='127.0.0.1'
59264
后端class='nccl'
-->已完成设置秩=0
为秩=0完成的设置过程
回溯(最近一次呼叫最后一次):
文件“/home/miranda9/ML4Coq/ML4Coq proj/embeddings_zoo/tree_nns/main_brando.py”,第279行,在
main_分布式()
文件“/home/miranda9/ML4Coq/ML4Coq proj/embeddings_zoo/tree_nns/main_brando.py”,第188行,在main_中
spawn\u return=mp.spawn(fn=train,args=(opts,),nprocs=opts.world\u size)
文件“/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site packages/torch/multiprocessing/spawn.py”,第230行,在spawn中
返回启动进程(fn、args、nprocs、join、守护进程、启动方法='spawn')
文件“/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site packages/torch/multiprocessing/spawn.py”,第188行,在start_进程中
而不是上下文。join():
文件“/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site packages/torch/multiprocessing/spawn.py”,第150行,加入
raise ProcessRaisedException(消息,错误索引,失败的进程.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
--进程0因以下错误而终止:
回溯(最近一次呼叫最后一次):
文件“/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site packages/torch/multiprocessing/spawn.py”,第59行,换行
fn(i,*args)
文件“/home/miranda9/ML4Coq/ML4Coq proj/embeddings_zoo/tree_nns/main_brando.py”,列车212行
战术预测器=移动到ddp(排名、选择、战术预测器)
文件“/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py”,第162行,移动至ddp
model=DistributedDataParallel(model,find\u unused\u parameters=True,device\u id=[opts.gpu])
文件“/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site packages/torch/nn/parallel/distributed.py”,第446行,在__
自同步参数和缓冲区(权威等级=0)
文件“/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site packages/torch/nn/parallel/distributed.py”,第457行,在同步参数和缓冲区中
自.\u分布式\u广播\u合并(
文件“/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site packages/torch/nn/parallel/distributed.py”,第1155行,合并为分布式广播
区域广播合并(
运行时错误:NCCL错误在:/opt/conda/conda bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825,未处理的cuda错误,NCCL版本2.7.8
ncClunHandledCudError:调用CUDA函数失败。
这不是一个非常令人满意的答案,但这似乎是我最终的工作方式。我只是使用了pytorch 1.7.1,它是cuda版本10.2。只要加载了cuda 11.0,它似乎就可以工作。要安装该版本,请执行以下操作:
conda install -y pytorch==1.7.1 torchvision torchaudio cudatoolkit=10.2 -c pytorch -c conda-forge
如果您在HPC中,请执行模块avail
,以确保加载了正确的cuda版本。也许您需要为提交作业提供bash和其他资源。我的设置如下所示:
#!/bin/bash
echo JOB STARTED
# a submission job is usually empty and has the root of the submission so you probably need your HOME env var
export HOME=/home/miranda9
# to have modules work and the conda command work
source /etc/bashrc
source /etc/profile
source /etc/profile.d/modules.sh
source ~/.bashrc
source ~/.bash_profile
conda activate metalearningpy1.7.1c10.2
#conda activate metalearning1.7.1c11.1
#conda activate metalearning11.1
#module load cuda-toolkit/10.2
module load cuda-toolkit/11.1
#nvidia-smi
nvcc --version
#conda list
hostname
echo $PATH
which python
# - run script
python -u ~/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py
我还附和了其他有用的东西,比如nvcc版本,以确保加载工作正常(请注意,nvidia smi的顶部没有显示正确的cuda版本)
注意,我认为这可能只是一个bug,因为CUDA11.1+pytorch 1.8.1在本文中是新的
torch.cuda.set_device(opts.gpu) # https://github.com/pytorch/pytorch/issues/54550
但我不能说它总是有效或者为什么无效。我在当前代码中确实有它,但我认为我仍然会在pytorch 1.8.x+cuda 11.x中遇到错误
请参阅我的conda列表,以防有帮助:
$ conda list
# packages in environment at /home/miranda9/miniconda3/envs/metalearningpy1.7.1c10.2:
#
# Name Version Build Channel
_libgcc_mutex 0.1 main
absl-py 0.12.0 py38h06a4308_0
aioconsole 0.3.1 pypi_0 pypi
aiohttp 3.7.4 py38h27cfd23_1
anatome 0.0.1 pypi_0 pypi
argcomplete 1.12.2 pypi_0 pypi
astunparse 1.6.3 pypi_0 pypi
async-timeout 3.0.1 py38h06a4308_0
attrs 20.3.0 pyhd3eb1b0_0
beautifulsoup4 4.9.3 pyha847dfd_0
blas 1.0 mkl
blinker 1.4 py38h06a4308_0
boto 2.49.0 pypi_0 pypi
brotlipy 0.7.0 py38h27cfd23_1003
bzip2 1.0.8 h7b6447c_0
c-ares 1.17.1 h27cfd23_0
ca-certificates 2021.1.19 h06a4308_1
cachetools 4.2.1 pyhd3eb1b0_0
cairo 1.14.12 h8948797_3
certifi 2020.12.5 py38h06a4308_0
cffi 1.14.0 py38h2e261b9_0
chardet 3.0.4 py38h06a4308_1003
click 7.1.2 pyhd3eb1b0_0
cloudpickle 1.6.0 pypi_0 pypi
conda 4.9.2 py38h06a4308_0
conda-build 3.21.4 py38h06a4308_0
conda-package-handling 1.7.2 py38h03888b9_0
coverage 5.5 py38h27cfd23_2
crcmod 1.7 pypi_0 pypi
cryptography 3.4.7 py38hd23ed53_0
cudatoolkit 10.2.89 hfd86e86_1
cycler 0.10.0 py38_0
cython 0.29.22 py38h2531618_0
dbus 1.13.18 hb2f20db_0
decorator 5.0.3 pyhd3eb1b0_0
dgl-cuda10.2 0.6.0post1 py38_0 dglteam
dill 0.3.3 pyhd3eb1b0_0
expat 2.3.0 h2531618_2
fasteners 0.16 pypi_0 pypi
filelock 3.0.12 pyhd3eb1b0_1
flatbuffers 1.12 pypi_0 pypi
fontconfig 2.13.1 h6c09931_0
freetype 2.10.4 h7ca028e_0 conda-forge
fribidi 1.0.10 h7b6447c_0
future 0.18.2 pypi_0 pypi
gast 0.3.3 pypi_0 pypi
gcs-oauth2-boto-plugin 2.7 pypi_0 pypi
glib 2.63.1 h5a9c865_0
glob2 0.7 pyhd3eb1b0_0
google-apitools 0.5.31 pypi_0 pypi
google-auth 1.28.0 pyhd3eb1b0_0
google-auth-oauthlib 0.4.3 pyhd3eb1b0_0
google-pasta 0.2.0 pypi_0 pypi
google-reauth 0.1.1 pypi_0 pypi
graphite2 1.3.14 h23475e2_0
graphviz 2.40.1 h21bd128_2
grpcio 1.32.0 pypi_0 pypi
gst-plugins-base 1.14.0 hbbd80ab_1
gstreamer 1.14.0 hb453b48_1
gsutil 4.60 pypi_0 pypi
gym 0.18.0 pypi_0 pypi
h5py 2.10.0 pypi_0 pypi
harfbuzz 1.8.8 hffaf4a1_0
higher 0.2.1 pypi_0 pypi
httplib2 0.19.0 pypi_0 pypi
icu 58.2 he6710b0_3
idna 2.10 pyhd3eb1b0_0
importlib-metadata 3.7.3 py38h06a4308_1
intel-openmp 2020.2 254
jinja2 2.11.3 pyhd3eb1b0_0
joblib 1.0.1 pyhd3eb1b0_0
jpeg 9b h024ee3a_2
keras-preprocessing 1.1.2 pypi_0 pypi
kiwisolver 1.3.1 py38h2531618_0
lark-parser 0.6.5 pypi_0 pypi
lcms2 2.11 h396b838_0
ld_impl_linux-64 2.33.1 h53a641e_7
learn2learn 0.1.5 pypi_0 pypi
libarchive 3.4.2 h62408e4_0
libffi 3.2.1 hf484d3e_1007
libgcc-ng 9.1.0 hdf63c60_0
libgfortran-ng 7.3.0 hdf63c60_0
liblief 0.10.1 he6710b0_0
libpng 1.6.37 h21135ba_2 conda-forge
libprotobuf 3.14.0 h8c45485_0
libstdcxx-ng 9.1.0 hdf63c60_0
libtiff 4.1.0 h2733197_1
libuuid 1.0.3 h1bed415_2
libuv 1.40.0 h7b6447c_0
libxcb 1.14 h7b6447c_0
libxml2 2.9.10 hb55368b_3
lmdb 0.94 pypi_0 pypi
lz4-c 1.9.2 he1b5a44_3 conda-forge
markdown 3.3.4 py38h06a4308_0
markupsafe 1.1.1 py38h7b6447c_0
matplotlib 3.3.4 py38h06a4308_0
matplotlib-base 3.3.4 py38h62a2d02_0
memory-profiler 0.58.0 pypi_0 pypi
mkl 2020.2 256
mkl-service 2.3.0 py38h1e0a361_2 conda-forge
mkl_fft 1.3.0 py38h54f3939_0
mkl_random 1.2.0 py38hc5bc63f_1 conda-forge
mock 2.0.0 pypi_0 pypi
monotonic 1.5 pypi_0 pypi
multidict 5.1.0 py38h27cfd23_2
ncurses 6.2 he6710b0_1
networkx 2.5 py_0
ninja 1.10.2 py38hff7bd54_0
numpy 1.19.2 py38h54aff64_0
numpy-base 1.19.2 py38hfa32c7d_0
oauth2client 4.1.3 pypi_0 pypi
oauthlib 3.1.0 py_0
olefile 0.46 pyh9f0ad1d_1 conda-forge
openssl 1.1.1k h27cfd23_0
opt-einsum 3.3.0 pypi_0 pypi
ordered-set 4.0.2 pypi_0 pypi
pandas 1.2.3 py38ha9443f7_0
pango 1.42.4 h049681c_0
patchelf 0.12 h2531618_1
pbr 5.5.1 pypi_0 pypi
pcre 8.44 he6710b0_0
pexpect 4.6.0 pypi_0 pypi
pillow 7.2.0 pypi_0 pypi
pip 21.0.1 py38h06a4308_0
pixman 0.40.0 h7b6447c_0
pkginfo 1.7.0 py38h06a4308_0
progressbar2 3.39.3 pypi_0 pypi
protobuf 3.14.0 py38h2531618_1
psutil 5.8.0 py38h27cfd23_1
ptyprocess 0.7.0 pypi_0 pypi
py-lief 0.10.1 py38h403a769_0
pyasn1 0.4.8 py_0
pyasn1-modules 0.2.8 py_0
pycapnp 1.0.0 pypi_0 pypi
pycosat 0.6.3 py38h7b6447c_1
pycparser 2.20 py_2
pyglet 1.5.0 pypi_0 pypi
pyjwt 1.7.1 py38_0
pyopenssl 20.0.1 pyhd3eb1b0_1
pyparsing 2.4.7 pyhd3eb1b0_0
pyqt 5.9.2 py38h05f1152_4
pysocks 1.7.1 py38h06a4308_0
python 3.8.2 hcf32534_0
python-dateutil 2.8.1 pyhd3eb1b0_0
python-libarchive-c 2.9 pyhd3eb1b0_0
python-utils 2.5.6 pypi_0 pypi
python_abi 3.8 1_cp38 conda-forge
pytorch 1.7.1 py3.8_cuda10.2.89_cudnn7.6.5_0 pytorch
pytz 2021.1 pyhd3eb1b0_0
pyu2f 0.1.5 pypi_0 pypi
pyyaml 5.4.1 py38h27cfd23_1
qt 5.9.7 h5867ecd_1
readline 8.1 h27cfd23_0
requests 2.25.1 pyhd3eb1b0_0
requests-oauthlib 1.3.0 py_0
retry-decorator 1.1.1 pypi_0 pypi
ripgrep 12.1.1 0
rsa 4.7.2 pyhd3eb1b0_1
ruamel_yaml 0.15.100 py38h27cfd23_0
scikit-learn 0.24.1 py38ha9443f7_0
scipy 1.6.2 py38h91f5cce_0
setuptools 52.0.0 py38h06a4308_0
sexpdata 0.0.3 pypi_0 pypi
sip 4.19.13 py38he6710b0_0
six 1.15.0 pyh9f0ad1d_0 conda-forge
soupsieve 2.2.1 pyhd3eb1b0_0
sqlite 3.35.2 hdfb4753_0
tensorboard 2.4.0 pyhc547734_0
tensorboard-plugin-wit 1.6.0 py_0
tensorflow 2.4.1 pypi_0 pypi
tensorflow-estimator 2.4.0 pypi_0 pypi
termcolor 1.1.0 pypi_0 pypi
threadpoolctl 2.1.0 pyh5ca1d4c_0
tk 8.6.10 hbc83047_0
torchaudio 0.7.2 py38 pytorch
torchmeta 1.7.0 pypi_0 pypi
torchtext 0.8.1 py38 pytorch
torchvision 0.8.2 py38_cu102 pytorch
tornado 6.1 py38h27cfd23_0
tqdm 4.56.0 pypi_0 pypi
typing-extensions 3.7.4.3 0
typing_extensions 3.7.4.3 py_0 conda-forge
urllib3 1.26.4 pyhd3eb1b0_0
werkzeug 1.0.1 pyhd3eb1b0_0
wheel 0.36.2 pyhd3eb1b0_0
wrapt 1.12.1 pypi_0 pypi
xz 5.2.5 h7b6447c_0
yaml 0.2.5 h7b6447c_0
yarl 1.6.3 py38h27cfd23_0
zipp 3.4.1 pyhd3eb1b0_0
zlib 1.2.11 h7b6447c_3
zstd 1.4.5 h9ceee32_0
既然你有足够的代表来查看删除的答案,我假设你已经尝试了
导出NCCL_ibu_DISABLE=1
?@jodag刚刚旋转了我为复制这个bug而制作的conda env。是的,它仍然会产生那个bug,不幸的是,即使是导出NCCL_ibu DISABLE=1
。其中一个删除的答案也暗示了关于导出的一些东西NCCL\u SOCKET\u IFNAME=
,但我不知道这意味着什么,也不知道如何获取
。如果你知道,请告诉我,我也可以尝试一下,并希望用答案和它工作的确切场景来结束这些问题(对于ppl来说,知道它是什么时候为我解决的)。我相信运行torch.cuda.set\u设备(opts.gpu) # https://github.com/pytorch/pytorch/issues/54550
如果您在dist.init\u process\u group(backend,rank=rank,world\u size=world\u size)之前运行它,它将起作用。到目前为止效果不错,但我不能保证它100%运行,但我的最小示例似乎有效。