Slurm和Openmpi:ORTE守护程序在启动后和通信回mpirun之前意外失败

Slurm和Openmpi:ORTE守护程序在启动后和通信回mpirun之前意外失败,mpi,openmpi,slurm,Mpi,Openmpi,Slurm,我在两个节点中安装了openmpi和slurm。我想使用slurm运行mpi作业。当我使用srun运行非mpi作业时,一切正常。但是,当我使用salloc运行mpi作业时,出现了一些错误。环境和规范如下 Env: slurm 17.02.1-2 mpirun(开放MPI)2.1.0 test.sh #!/bin/bash MACHINEFILE="nodes.$SLURM_JOB_ID" # Generate Machinefile for mpich such that hosts are

我在两个节点中安装了openmpi和slurm。我想使用slurm运行mpi作业。当我使用
srun
运行非mpi作业时,一切正常。但是,当我使用
salloc
运行mpi作业时,出现了一些错误。环境和规范如下

Env:

  • slurm 17.02.1-2
  • mpirun(开放MPI)2.1.0
  • test.sh

    #!/bin/bash
    
    MACHINEFILE="nodes.$SLURM_JOB_ID"
    
    # Generate Machinefile for mpich such that hosts are in the same
    #  order as if run via srun
    #
    srun -l /bin/hostname | sort -n | awk '{print $2}' > $MACHINEFILE
    
    source /home/slurm/allreduce/tf/tf-allreduce/bin/activate
    
    mpirun -np $SLURM_NTASKS -machinefile $MACHINEFILE test
    
    rm $MACHINEFILE
    
    命令

    salloc -N2 -n2 bash test.sh
    
    错误

    salloc: Granted job allocation 97
    --------------------------------------------------------------------------
    An ORTE daemon has unexpectedly failed after launch and before
    communicating back to mpirun. This could be caused by a number
    of factors, including an inability to create a connection back
    to mpirun due to a lack of common network interfaces and/or no
    route found between them. Please check network connectivity
    (including firewalls and network routing requirements).
    --------------------------------------------------------------------------
    salloc: Relinquishing job allocation 97
    

    有人能帮忙吗?谢谢。

    出于调试目的,能否在脚本中插入
    srun-N$SLURM\u NNODES-N$SLURM\u NNODES$(这是orted)
    作为旁注,您不需要
    -np
    机器文件
    选项。您可以简单地
    mpirun./test
    (如果您简单地使用
    mpirun test
    ,您可能会无意中使用
    /usr/bin/test