Testing 在多台主机上打开MPI-环c失败

Testing 在多台主机上打开MPI-环c失败,testing,installation,openmpi,Testing,Installation,Openmpi,我已经在两台Ubuntu14.04主机上打开了MPI,现在我正在用提供的两个测试函数hello_c和ring_c测试它的功能。主机名为“hermes”和“zeus”,它们都有用户“mpiuser”以非交互方式登录(通过ssh代理) 函数mpirun hello\u c和mpirun——主机hermes、宙斯hello\u c都工作正常 在本地调用函数mpirun--host zeus ring\u c也可以。赫密士和宙斯的输出: mpiuser@zeus:/opt/openmpi-1.6.5/e

我已经在两台Ubuntu14.04主机上打开了MPI,现在我正在用提供的两个测试函数hello_c和ring_c测试它的功能。主机名为“hermes”和“zeus”,它们都有用户“mpiuser”以非交互方式登录(通过ssh代理)

函数
mpirun hello\u c
mpirun——主机hermes、宙斯hello\u c
都工作正常

在本地调用函数
mpirun--host zeus ring\u c
也可以。赫密士和宙斯的输出:

mpiuser@zeus:/opt/openmpi-1.6.5/examples$ mpirun --host zeus ring_c
Process 0 sending 10 to 0, tag 201 (1 processes in ring)
Process 0 sent to 0
Process 0 decremented value: 9
Process 0 decremented value: 8
Process 0 decremented value: 7
Process 0 decremented value: 6
Process 0 decremented value: 5
Process 0 decremented value: 4
Process 0 decremented value: 3
Process 0 decremented value: 2
Process 0 decremented value: 1
Process 0 decremented value: 0
Process 0 exiting
但是调用函数
mpirun--host zeus,hermes ring_c
失败,并给出以下输出:

mpiuser@zeus:/opt/openmpi-1.6.5/examples$ mpirun --host hermes,zeus ring_c
Process 0 sending 10 to 1, tag 201 (2 processes in ring)
[zeus:2930] *** An error occurred in MPI_Recv
[zeus:2930] *** on communicator MPI_COMM_WORLD
[zeus:2930] *** MPI_ERR_TRUNCATE: message truncated
[zeus:2930] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
Process 0 sent to 1
--------------------------------------------------------------------------
mpirun has exited due to process rank 1 with PID 2930 on
node zeus exiting improperly. There are two reasons this could occur:

1. this process did not call "init" before exiting, but others in
the job did. This can cause a job to hang indefinitely while it waits
for all processes to call "init". By rule, if one process calls "init",
then ALL processes must call "init" prior to termination.

2. this process called "init", but exited without calling "finalize".
By rule, all processes that call "init" MUST call "finalize" prior to
exiting or it will be considered an "abnormal termination"

This may have caused other processes in the application to be
terminated by signals sent by mpirun (as reported here).
--------------------------------------------------------------------------

我还没有找到任何关于如何解决这样一个问题的文档,我也不知道在错误输出的基础上从哪里寻找错误如何解决此问题?

您在第一次和第二次运行之间改变了两件事—您将进程数从1个增加到了2个,并且在多台主机上运行,而不是在单个主机上运行

我建议您首先检查是否可以在同一主机上运行2个进程:

mpirun -n 2 ring_c
看看你能得到什么

在集群上调试时,了解每个进程的运行位置通常很有用。您还应该始终打印出进程的总数。尝试在ring_c.c的顶部使用以下代码:

char nodename[MPI_MAX_PROCESSOR_NAME];
int namelen;

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);

MPI_Get_processor_name(nodename, &namelen);
printf("Rank %d out of %d running on node %s\n", rank, size, nodename);

您收到的错误是,传入消息对于接收缓冲区来说太大,这很奇怪,因为代码总是发送和接收单个整数。

mpirun-n 2 ring\u c
在同一主机上工作。但我想我已经发现了错误。环境变量不正确,
sshuser@IPenv
没有显示正确的$PATH和$LD\u LIBRARY\u路径。所以我用
mpirun--prefix/opt/openmpi-host hermes,zeus ring__c
试过了,效果很好。所以我必须找出导出变量的正确方法。