C++ MPI_Allreduce中的致命错误
我需要使用MPICH创建集群。在本例中,我首先在一台机器上尝试了这些示例,这些都是预期的工作。然后我根据这个创建了集群,并运行下面给出的示例,它可以正常工作C++ MPI_Allreduce中的致命错误,c++,mpi,mpich,mpic++,C++,Mpi,Mpich,Mpic++,我需要使用MPICH创建集群。在本例中,我首先在一台机器上尝试了这些示例,这些都是预期的工作。然后我根据这个创建了集群,并运行下面给出的示例,它可以正常工作 #include <stdio.h> #include <mpi.h> int main(int argc, char** argv) { int myrank, nprocs; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &a
#include <stdio.h>
#include <mpi.h>
int main(int argc, char** argv) {
int myrank, nprocs;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
printf("Hello from processor %d of %d\n", myrank, nprocs);
MPI_Finalize();
return 0;
接下来我运行了这个例子,但当时我得到了这个错误。不知道哪里出了问题
Fatal error in MPI_Allreduce: A process has failed, error stack:
MPI_Allreduce(861)........: MPI_Allreduce(sbuf=0x7ffff0f55630, rbuf=0x7ffff0f55634, count=1, MPI_FLOAT, MPI_SUM, MPI_COMM_WORLD) failed
MPIR_Allreduce_impl(719)..:
MPIR_Allreduce_intra(362).:
dequeue_and_set_error(888): Communication error with rank 1
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= EXIT CODE: 1
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:1@ce-412] HYD_pmcd_pmip_control_cmd_cb (./pm/pmiserv/pmip_cb.c:886): assert (!closed) failed
[proxy:0:1@ce-412] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status
[proxy:0:1@ce-412] main (./pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@ce-411] HYDT_bscu_wait_for_completion (./tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@ce-411] HYDT_bsci_wait_for_completion (./tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@ce-411] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:217): launcher returned error waiting for completion
[mpiexec@ce-411] main (./ui/mpich/mpiexec.c:331): process manager error waiting for completion
是的,正如@Alexey提到的,这完全是一个网络错误。以下是我所做的使这项工作得以实施的事情 一,。将主机文件导出为HYDRA_host_文件,以了解MPICH的详细信息: 最后是一个命令,它为我提供了集群节点之间的正确连接
mpiexec -launcher fork -disable-hostname-propagation -f machinefile -np 4 ./Test
根据秩1消息的通信错误,秩0的主节点无法连接到秩1的节点,因此您应该查看该方向。您可以尝试使用简单的MPI_发送和MPI_接收从根目录ping节点1。这些方法也不起作用,因此肯定是网络设置错误。您是否尝试检查以下内容?
export HYDRA_HOST_FILE=<path_to_host_file>/hosts
-disable-hostname-propagation
mpiexec -launcher fork -disable-hostname-propagation -f machinefile -np 4 ./Test