Warning: file_get_contents(/data/phpspider/zhask/data//catemap/1/amazon-web-services/13.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
Amazon web services 在OpenMPI中设置BTL标志_Amazon Web Services_Mpi_Cluster Computing_Openmpi - Fatal编程技术网

Amazon web services 在OpenMPI中设置BTL标志

Amazon web services 在OpenMPI中设置BTL标志,amazon-web-services,mpi,cluster-computing,openmpi,Amazon Web Services,Mpi,Cluster Computing,Openmpi,我正在尝试对我的OpenMPI安装运行一个简单的helloworld测试。我已经在AmazonAWS上建立了一个两节点集群,我正在使用SUSE SLES11 SP3、OpenMPI 1.4.4(有点旧,但我的Linux发行版没有新的二进制文件)。我已经到了最后一步,在正确设置btl标志时遇到了一些问题 他是我能做的: 我可以在节点之间进行双向scp,这样无密码SSH就可以正常运行了 如果我运行iptables-L,它表明没有防火墙,所以我认为节点之间的通信应该可以工作 我可以使用mpicc编译

我正在尝试对我的OpenMPI安装运行一个简单的helloworld测试。我已经在AmazonAWS上建立了一个两节点集群,我正在使用SUSE SLES11 SP3、OpenMPI 1.4.4(有点旧,但我的Linux发行版没有新的二进制文件)。我已经到了最后一步,在正确设置btl标志时遇到了一些问题

他是我能做的:

  • 我可以在节点之间进行双向scp,这样无密码SSH就可以正常运行了

  • 如果我运行iptables-L,它表明没有防火墙,所以我认为节点之间的通信应该可以工作

  • 我可以使用mpicc编译我的helloworld.c程序,并且我已经确认脚本在另一个工作集群上正确运行,因此我认为本地路径设置正确,并且脚本肯定可以工作

  • 如果从主节点执行mpirun,并且仅使用主节点,helloworld将正确执行:

    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host master --mca btl sm,openib,self ./helloworldmpi
    ip-xxx-xxx-xxx-133: hello world from process 0 of 1
    
    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host node001 --mca btl sm,openib,self./helloworldmpi
    ip-xxx-xxx-xxx-210: hello world from process 0 of 1
    
  • 如果仅使用工作节点从主节点执行mpirun,helloworld将正确执行:

    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host master --mca btl sm,openib,self ./helloworldmpi
    ip-xxx-xxx-xxx-133: hello world from process 0 of 1
    
    ip-xxx-xxx-xxx-133: # mpirun -n 1 -host node001 --mca btl sm,openib,self./helloworldmpi
    ip-xxx-xxx-xxx-210: hello world from process 0 of 1
    
现在,我的问题是,如果我尝试在两个节点上运行helloworld,我会得到一个错误:

ip-xxx-xxx-xxx-133: # mpirun -n 2 -host master,node001 --mca btl openib,self ./helloworldmpi
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[5228,1],0]) is on host: ip-xxx-xxx-xxx-133
  Process 2 ([[5228,1],1]) is on host: node001
  BTLs attempted: self

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  PML add procs failed
  --> Returned "Unreachable" (-12) instead of "Success" (0)
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-133:7037] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
--------------------------------------------------------------------------
mpirun has exited due to process rank 0 with PID 7037 on
node ip-xxx-xxx-xxx-133 exiting without calling "finalize". This may
have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------
*** The MPI_Init() function was called before MPI_INIT was invoked.
*** This is disallowed by the MPI standard.
*** Your MPI job will now abort.
[ip-xxx-xxx-xxx-210:5838] Abort before MPI_INIT completed successfully; not able to guarantee that all other processes were killed!
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mca-bml-r2.txt / unreachable proc
[ip-xxx-xxx-xxx-133:07032] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[ip-xxx-xxx-xxx-133:07032] 1 more process has sent help message help-mpi-runtime / mpi_init:startup:internal-failure

最后,如果我省略了-mca-btl-sm、openib、self标志,那么任何东西都不起作用。我承认我对这些旗帜的理解几乎为零。然而,网络上几乎没有关于它们使用的信息。我查看了data.conf文件,不确定列出的所有设备是否都存在,但-mca标志似乎解决了大部分问题,因为我至少可以在集群中的每个节点上单独执行。任何关于我可能做错了什么,或者我可能会去哪里的指针都将不胜感激。

为了记录在案,我只需将tcp添加到-mca btl标志中,现在它就可以正常工作了。

“--mca btl openib,sm,self”告诉Open MPI哪些传输用于MPI流量。您指定了:

  • openib:InfiniBand或iWARP
  • 共享内存
  • self:环回
据我所知(尽管我没有密切关注AWS),AWS没有InifniBand或iWARP。因此,在这里指定openib是无用的。如果您将“tcp”添加到逗号分隔的列表中,它应该使用tcp,这应该是您想要的。具体来说,“--mca btl tcp、sm、self”(逗号分隔列表中的顺序无关紧要)


也就是说,OpenMPI在默认情况下应该有效地挑拣sm、tcp和self——因此您根本不需要指定“-mca btl tcp、sm、self”。这对你不起作用,我觉得有点奇怪。

谢谢。这很好地解释了我昨天工作了几个小时后得出的结论。亚马逊不使用Infiniband。关于你最后的评论,我也不确定,但我认为这可能是由我的data.conf文件引起的。我认为该文件列出了一些实际上不存在的硬件(我从Amazon上的另一个LinuxAMI借用了该文件)。使用btl标志可以从mpi的视图中以某种方式从data.conf文件中过滤出有问题的行。如果我不使用btl标志,mpirun会抱怨缺少cma等等(我现在无法找出确切的错误),并抛出一个错误。