Python mpich2和mpi4py问题,以散射和getter为例
我试图使用python运行一个scatter和gatter示例,但遇到了一些问题。为了确保集群正常工作,我尝试了helloworld:Python mpich2和mpi4py问题,以散射和getter为例,python,mpi,mpich,Python,Mpi,Mpich,我试图使用python运行一个scatter和gatter示例,但遇到了一些问题。为了确保集群正常工作,我尝试了helloworld: $ cat /var/nfs/helloworld.py #!/usr/bin/env python """ Parallel Hello World """ from mpi4py import MPI import sys size = MPI.COMM_WORLD.Get_size() rank = MPI.COMM_WORLD.Get_rank()
$ cat /var/nfs/helloworld.py
#!/usr/bin/env python
"""
Parallel Hello World
"""
from mpi4py import MPI
import sys
size = MPI.COMM_WORLD.Get_size()
rank = MPI.COMM_WORLD.Get_rank()
name = MPI.Get_processor_name()
sys.stdout.write( "Hello, World! I am process %d of %d on %s.\n" % (rank, size, name))
看来我的机器文件:
$ cat /var/nfs/machinefile
node1:8
node2:8
desktop01:8
因此,查找树节点的lscpu-p
输出
$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,0,0,0,,0,0,0,0
5,1,0,0,,1,1,1,0
6,2,0,0,,2,2,2,0
7,3,0,0,,3,3,3,0
并使用以下命令按预期运行
$ mpiexec.hydra -np 24 --machinefile /var/nfs/machinefile python /var/nfs/helloworld.py
Hello, World! I am process 22 of 24 on desktop01.
Hello, World! I am process 19 of 24 on desktop01.
Hello, World! I am process 20 of 24 on desktop01.
Hello, World! I am process 18 of 24 on desktop01.
Hello, World! I am process 23 of 24 on desktop01.
Hello, World! I am process 16 of 24 on desktop01.
Hello, World! I am process 21 of 24 on desktop01.
Hello, World! I am process 17 of 24 on desktop01.
Hello, World! I am process 5 of 24 on node1.
Hello, World! I am process 0 of 24 on node1.
Hello, World! I am process 3 of 24 on node1.
Hello, World! I am process 4 of 24 on node1.
Hello, World! I am process 15 of 24 on node2.
Hello, World! I am process 13 of 24 on node2.
Hello, World! I am process 11 of 24 on node2.
Hello, World! I am process 8 of 24 on node2.
Hello, World! I am process 6 of 24 on node1.
Hello, World! I am process 1 of 24 on node1.
Hello, World! I am process 10 of 24 on node2.
Hello, World! I am process 12 of 24 on node2.
Hello, World! I am process 14 of 24 on node2.
Hello, World! I am process 9 of 24 on node2.
Hello, World! I am process 7 of 24 on node1.
Hello, World! I am process 2 of 24 on node1.
$
基于此,我假设我的集群正在工作
现在我尝试了mpi4py(2.0.0)附带的a演示,我使用的是python 3,所有节点都是Linux,运行mpich2(3.1.2)
尝试运行时,失败:
$ mpiexec.hydra -np 24 --machinefile /var/nfs/machinefile /var/nfs/3.py
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0. 1. 2. 3.]
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
*** stack smashing detected ***: python3 terminated
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 6022 RUNNING AT node1
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@desktop01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@desktop01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@desktop01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@desktop01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@desktop01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@desktop01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@desktop01] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
我做错了什么?我应该发送到散布/gatter方法的数据量之间是否存在不匹配?这是mpich2中的已知问题吗?我很确定,它曾经在使用openmpi+Python2时工作过——但是现在我无法测试它 您是否确定所有三个节点都在(仅)运行mpich?你确定其中一个没有默认安装openmpi吗?我在所有节点上重新刷新了操作系统(ubuntu),并从源代码中安装了mpich。。所有节点都运行相同的版本。除此之外,我在主节点上调用hydra,在其他节点上运行的mpi代理控制运行的版本。。所以不可能混合版本…检查所有节点上的网络接口。确保没有防火墙阻止它们之间的TCP/IP通信。确保没有匹配网络地址但不提供实际连接的网络接口(例如虚拟网桥)。Hello World示例并不能证明MPI正在工作,因为它不涉及任何通信(只涉及I/O重定向)。。防火墙在所有机器中都被禁用,但我会再次检查它。这三个节点连接在同一个交换机中,没有防火墙。。但是我会调查的。节点之间没有防火墙。您确定这三个节点都在(仅)运行mpich吗?你确定其中一个没有默认安装openmpi吗?我在所有节点上重新刷新了操作系统(ubuntu),并从源代码中安装了mpich。。所有节点都运行相同的版本。除此之外,我在主节点上调用hydra,在其他节点上运行的mpi代理控制运行的版本。。所以不可能混合版本…检查所有节点上的网络接口。确保没有防火墙阻止它们之间的TCP/IP通信。确保没有匹配网络地址但不提供实际连接的网络接口(例如虚拟网桥)。Hello World示例并不能证明MPI正在工作,因为它不涉及任何通信(只涉及I/O重定向)。。防火墙在所有机器中都被禁用,但我会再次检查它。这三个节点连接在同一个交换机中,没有防火墙。。不过,我将对此进行调查。节点之间没有防火墙
$ mpiexec.hydra -np 24 --machinefile /var/nfs/machinefile /var/nfs/3.py
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0. 1. 2. 3.]
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
------------------------------------------------------------------------------
Running on 24 cores
------------------------------------------------------------------------------
After Scatter:
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
*** stack smashing detected ***: python3 terminated
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
Traceback (most recent call last):
File "/var/nfs/3.py", line 32, in <module>
comm.Barrier()
File "MPI/Comm.pyx", line 568, in mpi4py.MPI.Comm.Barrier (src/mpi4py.MPI.c:97474)
mpi4py.MPI.Exception: Other MPI error, error stack:
PMPI_Barrier(428).......: MPI_Barrier(MPI_COMM_WORLD) failed
MPIR_Barrier_impl(335)..: Failure during collective
MPIR_Barrier_impl(328)..:
MPIR_Barrier(292).......:
MPIR_Barrier_intra(149).:
barrier_smp_intra(109)..:
MPIR_Bcast_impl(1458)...:
MPIR_Bcast(1482)........:
MPIR_Bcast_intra(1291)..:
MPIR_Bcast_binomial(309): Failure during collective
===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= PID 6022 RUNNING AT node1
= EXIT CODE: 6
= CLEANING UP REMAINING PROCESSES
= YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
[proxy:0:2@desktop01] HYD_pmcd_pmip_control_cmd_cb (pm/pmiserv/pmip_cb.c:885): assert (!closed) failed
[proxy:0:2@desktop01] HYDT_dmxu_poll_wait_for_event (tools/demux/demux_poll.c:76): callback returned error status
[proxy:0:2@desktop01] main (pm/pmiserv/pmip.c:206): demux engine error waiting for event
[mpiexec@desktop01] HYDT_bscu_wait_for_completion (tools/bootstrap/utils/bscu_wait.c:76): one of the processes terminated badly; aborting
[mpiexec@desktop01] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[mpiexec@desktop01] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[mpiexec@desktop01] main (ui/mpich/mpiexec.c:344): process manager error waiting for completion
$ mpiexec.hydra -np 8 --machinefile /var/nfs/machinefile /var/nfs/3.py
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[0] [ 0. 1. 2. 3.]
After Allgather:
[0] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[1] [ 4. 5. 6. 7.]
After Allgather:
[1] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[2] [ 8. 9. 10. 11.]
After Allgather:
[2] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[3] [ 12. 13. 14. 15.]
After Allgather:
[3] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[4] [ 16. 17. 18. 19.]
After Allgather:
[4] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[5] [ 20. 21. 22. 23.]
After Allgather:
[5] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[6] [ 24. 25. 26. 27.]
After Allgather:
[6] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
------------------------------------------------------------------------------
Running on 8 cores
------------------------------------------------------------------------------
After Scatter:
[7] [ 28. 29. 30. 31.]
After Allgather:
[7] [ 0. 2. 4. 6. 8. 10. 12. 14. 16. 18. 20. 22. 24. 26. 28.
30. 32. 34. 36. 38. 40. 42. 44. 46. 48. 50. 52. 54. 56. 58.
60. 62.]
$