Parallel processing 并行HDF5:“;勾选“;运行t\u mpi时挂起

Parallel processing 并行HDF5:“;勾选“;运行t\u mpi时挂起,parallel-processing,mpi,hdf5,lustre,Parallel Processing,Mpi,Hdf5,Lustre,我一直在努力让并行HDF5在集群上工作整整一周,但没有任何进展。我希望有人能帮我做这件事。谢谢 我正在使用RedHat Enterprise Linux 5.5 x86_64操作系统在lustre文件系统上构建并行HDF5(HDF5-1.8.15-patch1)。我尝试用impi 4.0.2和openmpi 1.8编译它,它成功了,没有任何错误。当我“进行检查”时,它们都通过了串行测试,但在进入并行测试后立即挂起(特别是t_mpi)。最后,我不得不按ctrl+C组合键结束它。以下是输出: lij

我一直在努力让并行HDF5在集群上工作整整一周,但没有任何进展。我希望有人能帮我做这件事。谢谢

我正在使用RedHat Enterprise Linux 5.5 x86_64操作系统在lustre文件系统上构建并行HDF5(HDF5-1.8.15-patch1)。我尝试用impi 4.0.2和openmpi 1.8编译它,它成功了,没有任何错误。当我“进行检查”时,它们都通过了串行测试,但在进入并行测试后立即挂起(特别是t_mpi)。最后,我不得不按ctrl+C组合键结束它。以下是输出:

lijm@c01b03:~/yuan/hdf5-1.8.15-patch1/testpar$ make check
  CC       t_mpi.o
t_mpi.c: In function ‘test_mpio_gb_file’:
t_mpi.c:284: warning: passing argument 1 of ‘malloc’ with different width due to prototype
t_mpi.c:284: warning: request for implicit conversion from ‘void *’ to ‘char *’ not permitted in C++
t_mpi.c: In function ‘test_mpio_1wMr’:
t_mpi.c:465: warning: passing argument 2 of ‘gethostname’ with different width due to prototype
t_mpi.c: In function ‘test_mpio_derived_dtype’:
t_mpi.c:682: warning: declaration of ‘nerrors’ shadows a global declaration
t_mpi.c:37: warning: shadowed declaration is here
t_mpi.c:771: warning: passing argument 5 of ‘MPI_File_set_view’ discards qualifiers from pointer target type
t_mpi.c:798: warning: passing argument 2 of ‘MPI_File_set_view’ with different width due to prototype
t_mpi.c:798: warning: passing argument 5 of ‘MPI_File_set_view’ discards qualifiers from pointer target type
t_mpi.c:685: warning: unused variable ‘etypenew’
t_mpi.c:682: warning: unused variable ‘nerrors’
t_mpi.c: In function ‘main’:
t_mpi.c:1104: warning: too many arguments for format
t_mpi.c: In function ‘test_mpio_special_collective’:
t_mpi.c:991: warning: will never be executed
t_mpi.c:992: warning: will never be executed
t_mpi.c:995: warning: will never be executed
t_mpi.c: In function ‘test_mpio_gb_file’:
t_mpi.c:229: warning: will never be executed
t_mpi.c:232: warning: will never be executed
t_mpi.c:237: warning: will never be executed
t_mpi.c:238: warning: will never be executed
t_mpi.c:253: warning: will never be executed
t_mpi.c:258: warning: will never be executed
t_mpi.c:259: warning: will never be executed
t_mpi.c:281: warning: will never be executed
t_mpi.c:246: warning: will never be executed
t_mpi.c:267: warning: will never be executed
t_mpi.c:319: warning: will never be executed
t_mpi.c:343: warning: will never be executed
t_mpi.c:385: warning: will never be executed
t_mpi.c:389: warning: will never be executed
t_mpi.c:248: warning: will never be executed
t_mpi.c:269: warning: will never be executed
t_mpi.c: In function ‘main’:
t_mpi.c:1143: warning: will never be executed
t_mpi.c:88: warning: will never be executed
t_mpi.c:102: warning: will never be executed
t_mpi.c:133: warning: will never be executed
t_mpi.c:142: warning: will never be executed
  CCLD     t_mpi
make  t_mpi testphdf5 t_cache t_pflush1 t_pflush2 t_pshutdown t_prestart t_shapesame
make[1]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[1]: `t_mpi' is up to date.
make[1]: `testphdf5' is up to date.
make[1]: `t_cache' is up to date.
make[1]: `t_pflush1' is up to date.
make[1]: `t_pflush2' is up to date.
make[1]: `t_pshutdown' is up to date.
make[1]: `t_prestart' is up to date.
make[1]: `t_shapesame' is up to date.
make[1]: Leaving directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make  check-TESTS
make[1]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[2]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[3]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[3]: Nothing to be done for `_exec_check-s'.
make[3]: Leaving directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[2]: Leaving directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
make[2]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
===Parallel tests in testpar begin Thu Jun 11 22:07:48 CST 2015===
**** Hint ****
Parallel test files reside in the current directory by default.
Set HDF5_PARAPREFIX to use another directory. E.g.,
HDF5_PARAPREFIX=/PFS/user/me
export HDF5_PARAPREFIX
make check
**** end of Hint ****
make[3]: Entering directory `/home/lijm/yuan/hdf5-1.8.15-patch1/testpar'
============================
Testing  t_mpi
============================
 t_mpi  Test Log
============================
===================================
MPI functionality tests
===================================
Proc 1: hostname=c01b03
Proc 2: hostname=c01b03
Proc 3: hostname=c01b03
Proc 5: hostname=c01b03
--------------------------------
Proc 0: *** MPIO 1 write Many read test...
--------------------------------
Proc 0: hostname=c01b03
Proc 4: hostname=c01b03
Command exited with non-zero status 255
0.08user 0.01system 0:37.65elapsed 0%CPU (0avgtext+0avgdata    0maxresident)k
0inputs+0outputs (0major+5987minor)pagefaults 0swaps
make[3]: *** [t_mpi.chkexe_] Error 1
make[2]: *** [build-check-p] Interrupt
make[1]: *** [test] Interrupt
make: *** [check-am] Interrupt
上述两种MPI实现的输出相同,但openmpi也会输出警告:

警告:您的OpenFabrics子系统似乎配置为仅允许注册部分物理内存。这可能导致MPI作业运行时性能不稳定、挂起和/或崩溃

我一直在寻找这个问题。但我不认为这可能是绞刑的原因,原因在最后陈述

我试着找到它挂的地方。我发现它总是被它遇到的第一个集体功能卡住。例如,在t_mpi中。它首先挂在:

MPI文件删除(文件名,MPI信息为空);(第477行)

在测试中。如果我把这一行注释掉,它就会卡在下面打开的MPI文件上。但我不确定这些函数内部发生了什么

还有一件事我注意到了。我在HDF5中“make”的文件夹位于NFS文件系统中,我只能通过位于其他地方的特定文件夹访问lustre。因此,我发现如果我不将HDF5_PARAPERFIX设置为我的lustre文件夹,测试运行得非常好,因为默认情况下测试是在本地执行的。所以,我想这应该是一个与光泽本身有关的问题,而不是记忆的限制


谢谢大家!

很难说这里发生了什么

可能是您正在将“通用unix文件系统”应用于lustre驱动程序。“英特尔MPI”需要两个环境变量(I_MPI_EXTRA_FILESYSTEM和I_MPI_EXTRA_FILESYSTEM_LIST)来使用lustre优化的代码路径:(有关详细信息,请参阅)

在构建OpenMPI时,您也必须明确请求lustre支持


如果您可以将调试器附加到一个或多个卡住的进程,以查看其挂起的位置,这将非常有帮助。卡在i/o例行程序上?陷入沟通

嗨,罗布,谢谢你的回复!我们的集群管理员告诉我,我用来运行测试的文件夹“不稳定”,在切换到另一个集群文件夹后,问题就解决了。我也不知道那个文件夹怎么了。。。