Slurm和Openmpi:ORTE守护程序在启动后和通信回mpirun之前意外失败
我在两个节点中安装了openmpi和slurm。我想使用slurm运行mpi作业。当我使用Slurm和Openmpi:ORTE守护程序在启动后和通信回mpirun之前意外失败,mpi,openmpi,slurm,Mpi,Openmpi,Slurm,我在两个节点中安装了openmpi和slurm。我想使用slurm运行mpi作业。当我使用srun运行非mpi作业时,一切正常。但是,当我使用salloc运行mpi作业时,出现了一些错误。环境和规范如下 Env: slurm 17.02.1-2 mpirun(开放MPI)2.1.0 test.sh #!/bin/bash MACHINEFILE="nodes.$SLURM_JOB_ID" # Generate Machinefile for mpich such that hosts are
srun
运行非mpi作业时,一切正常。但是,当我使用salloc
运行mpi作业时,出现了一些错误。环境和规范如下
Env:
#!/bin/bash
MACHINEFILE="nodes.$SLURM_JOB_ID"
# Generate Machinefile for mpich such that hosts are in the same
# order as if run via srun
#
srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
source /home/slurm/allreduce/tf/tf-allreduce/bin/activate
mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE test
rm $MACHINEFILE
命令
salloc -N2 -n2 bash test.sh
错误
salloc: Granted job allocation 97
--------------------------------------------------------------------------
An ORTE daemon has unexpectedly failed after launch and before
communicating back to mpirun. This could be caused by a number
of factors, including an inability to create a connection back
to mpirun due to a lack of common network interfaces and/or no
route found between them. Please check network connectivity
(including firewalls and network routing requirements).
--------------------------------------------------------------------------
salloc: Relinquishing job allocation 97
有人能帮忙吗?谢谢。出于调试目的,能否在脚本中插入
srun-N$SLURM\u NNODES-N$SLURM\u NNODES$(这是orted)
作为旁注,您不需要-np
或机器文件
选项。您可以简单地mpirun./test
(如果您简单地使用mpirun test
,您可能会无意中使用/usr/bin/test
)