Python mpich2和mpi4py问题,以散射和getter为例

Python mpich2和mpi4py问题,以散射和getter为例,python,mpi,mpich,Python,Mpi,Mpich,我试图使用python运行一个scatter和gatter示例,但遇到了一些问题。为了确保集群正常工作,我尝试了helloworld: $ cat /var/nfs/helloworld.py #!/usr/bin/env python """ Parallel Hello World """ from mpi4py import MPI import sys size = MPI.COMM_WORLD.Get_size() rank = MPI.COMM_WORLD.Get_rank()

我试图使用python运行一个scatter和gatter示例,但遇到了一些问题。为了确保集群正常工作,我尝试了helloworld:

$ cat /var/nfs/helloworld.py 
#!/usr/bin/env python
"""
Parallel Hello World
"""

from mpi4py import MPI
import sys

size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()

sys.stdout.write( "Hello, World! I am process %d of %d on %s.\n" % (rank, size, name))
看来我的机器文件:

$ cat /var/nfs/machinefile 
node1:8
node2:8
desktop01:8
因此,查找树节点的
lscpu-p
输出

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,0,0,0,,0,0,0,0
5,1,0,0,,1,1,1,0
6,2,0,0,,2,2,2,0
7,3,0,0,,3,3,3,0
并使用以下命令按预期运行

$ mpiexec.hydra  -np 24  --machinefile /var/nfs/machinefile python /var/nfs/helloworld.py 
Hello, World! I am process 22 of 24 on desktop01.
Hello, World! I am process 19 of 24 on desktop01.
Hello, World! I am process 20 of 24 on desktop01.
Hello, World! I am process 18 of 24 on desktop01.
Hello, World! I am process 23 of 24 on desktop01.
Hello, World! I am process 16 of 24 on desktop01.
Hello, World! I am process 21 of 24 on desktop01.
Hello, World! I am process 17 of 24 on desktop01.
Hello, World! I am process 5 of 24 on node1.
Hello, World! I am process 0 of 24 on node1.
Hello, World! I am process 3 of 24 on node1.
Hello, World! I am process 4 of 24 on node1.
Hello, World! I am process 15 of 24 on node2.
Hello, World! I am process 13 of 24 on node2.
Hello, World! I am process 11 of 24 on node2.
Hello, World! I am process 8 of 24 on node2.
Hello, World! I am process 6 of 24 on node1.
Hello, World! I am process 1 of 24 on node1.
Hello, World! I am process 10 of 24 on node2.
Hello, World! I am process 12 of 24 on node2.
Hello, World! I am process 14 of 24 on node2.
Hello, World! I am process 9 of 24 on node2.
Hello, World! I am process 7 of 24 on node1.
Hello, World! I am process 2 of 24 on node1.

$
基于此,我假设我的集群正在工作

现在我尝试了mpi4py(2.0.0)附带的a演示,我使用的是python 3,所有节点都是Linux,运行mpich2(3.1.2)

尝试运行时,失败:

$ mpiexec.hydra  -np 24   --machinefile /var/nfs/machinefile /var/nfs/3.py 
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0.  1.  2.  3.]
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
*** stack smashing detected ***: python3 terminated
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 6022 RUNNING AT node1
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@desktop01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@desktop01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@desktop01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@desktop01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@desktop01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@desktop01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@desktop01] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion

我做错了什么?我应该发送到散布/gatter方法的数据量之间是否存在不匹配?这是mpich2中的已知问题吗?我很确定,它曾经在使用openmpi+Python2时工作过——但是现在我无法测试它

您是否确定所有三个节点都在(仅)运行mpich?你确定其中一个没有默认安装openmpi吗?我在所有节点上重新刷新了操作系统(ubuntu),并从源代码中安装了mpich。。所有节点都运行相同的版本。除此之外,我在主节点上调用hydra,在其他节点上运行的mpi代理控制运行的版本。。所以不可能混合版本…检查所有节点上的网络接口。确保没有防火墙阻止它们之间的TCP/IP通信。确保没有匹配网络地址但不提供实际连接的网络接口(例如虚拟网桥)。Hello World示例并不能证明MPI正在工作,因为它不涉及任何通信(只涉及I/O重定向)。。防火墙在所有机器中都被禁用,但我会再次检查它。这三个节点连接在同一个交换机中,没有防火墙。。但是我会调查的。节点之间没有防火墙。您确定这三个节点都在(仅)运行mpich吗?你确定其中一个没有默认安装openmpi吗?我在所有节点上重新刷新了操作系统(ubuntu),并从源代码中安装了mpich。。所有节点都运行相同的版本。除此之外,我在主节点上调用hydra,在其他节点上运行的mpi代理控制运行的版本。。所以不可能混合版本…检查所有节点上的网络接口。确保没有防火墙阻止它们之间的TCP/IP通信。确保没有匹配网络地址但不提供实际连接的网络接口(例如虚拟网桥)。Hello World示例并不能证明MPI正在工作,因为它不涉及任何通信(只涉及I/O重定向)。。防火墙在所有机器中都被禁用,但我会再次检查它。这三个节点连接在同一个交换机中,没有防火墙。。不过,我将对此进行调查。节点之间没有防火墙
$ mpiexec.hydra  -np 24   --machinefile /var/nfs/machinefile /var/nfs/3.py 
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0.  1.  2.  3.]
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
 Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
*** stack smashing detected ***: python3 terminated
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
  File "/var/nfs/3.py", line 32, in <module>
    comm.Barrier()
  File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..: 
MPIR_Barrier(292).......: 
MPIR_Barrier_intra(149).: 
barrier_smp_intra(109)..: 
MPIR_Bcast_impl(1458)...: 
MPIR_Bcast(1482)........: 
MPIR_Bcast_intra(1291)..: 
MPIR_Bcast_binomial(309): Failure during collective

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 6022 RUNNING AT node1
=   EXIT CODE: 6
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@desktop01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@desktop01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@desktop01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@desktop01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@desktop01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@desktop01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@desktop01] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
$ mpiexec.hydra  -np 8   --machinefile /var/nfs/machinefile /var/nfs/3.py 
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0.  1.  2.  3.]
After Allgather:
[0] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[1] [ 4.  5.  6.  7.]
After Allgather:
[1] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[2] [  8.   9.  10.  11.]
After Allgather:
[2] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[3] [ 12.  13.  14.  15.]
After Allgather:
[3] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[4] [ 16.  17.  18.  19.]
After Allgather:
[4] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[5] [ 20.  21.  22.  23.]
After Allgather:
[5] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[6] [ 24.  25.  26.  27.]
After Allgather:
[6] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]
------------------------------------------------------------------------------
 Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[7] [ 28.  29.  30.  31.]
After Allgather:
[7] [  0.   2.   4.   6.   8.  10.  12.  14.  16.  18.  20.  22.  24.  26.  28.
  30.  32.  34.  36.  38.  40.  42.  44.  46.  48.  50.  52.  54.  56.  58.
  60.  62.]

$