C++ MPI大小和OpenMP线程数
我试图编写一个混合OpenMP/MPI程序,因此我试图理解OpenMP线程数量和MPI进程之间的相关性。因此,我创建了一个小测试程序:C++ MPI大小和OpenMP线程数,c++,multithreading,openmp,openmpi,C++,Multithreading,Openmp,Openmpi,我试图编写一个混合OpenMP/MPI程序,因此我试图理解OpenMP线程数量和MPI进程之间的相关性。因此,我创建了一个小测试程序: #include <iostream> #include <mpi.h> #include <thread> #include <sstream> #include <omp.h> int main(int args, char *argv[]) { int rank, nprocs, thr
#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>
int main(int args, char *argv[]) {
int rank, nprocs, thread_id, nthreads, cxx_procs;
MPI_Init(&args, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
#pragma omp parallel private(thread_id, nthreads, cxx_procs)
{
thread_id = omp_get_thread_num();
nthreads = omp_get_num_threads();
cxx_procs = std::thread::hardware_concurrency();
std::stringstream omp_stream;
omp_stream << "I'm thread " << thread_id
<< " out of " << nthreads
<< " on MPI process nr. " << rank
<< " out of " << nprocs
<< ", while hardware_concurrency reports " << cxx_procs
<< " processors\n";
std::cout << omp_stream.str();
}
MPI_Finalize();
return 0;
}
使用gcc-9.3.1
和openmpi3
。
现在,当使用4c/8t和/omp\u mpi
在i7-6700上执行它时,我得到以下输出
I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
i、 e.如预期。当使用
mpirun-n1omp\umpi
执行它时,我希望得到相同的结果,但我得到的是
I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
其他线程在哪里?当在两个MPI进程上执行它时,我得到
I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
i、 仍然只有两个OpenMP线程,但在四个MPI进程上执行时,我得到
I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
现在我突然发现每个MPI进程有八个OpenMP线程。这种变化从何而来?mpirun的手册页解释: 如果您只是想了解如何运行MPI应用程序, 您可能希望使用以下形式的命令行:
% mpirun [ -np X ] [ --hostfile <filename> ] <program>
%mpirun[-np X][--hostfile]
这将在您当前的数据库中运行X个副本
运行时环境(…)
请注意,mpirun会自动绑定到的进程
v1.8系列的开始。在本例中使用了三种绑定模式
没有任何进一步的指示:
Bind to core: when the number of processes is <= 2
Bind to socket: when the number of processes is > 2
Bind to none: when oversubscribed
绑定到核心:当进程数为2时
绑定到无:当超额订阅时
如果应用程序使用线程,那么可能需要确保
你要么根本不受约束
(通过指定--bind to none)或绑定到多个核心
使用适当的绑定级别或特定编号
每个应用程序进程的处理元素数量
现在,如果指定1个或2个MPI进程,mpirun
默认为--bind to core
,这将导致每个MPI进程有2个线程。
但是,如果指定4个MPI进程,那么mpirun默认为
--bind to socket
,并且每个进程有8个线程,因为您的机器是一个套接字。我在笔记本电脑(1s/2c/4t)、工作站(2个插座,每个插座12个核,每个核2个线程)和程序(没有np
参数)上测试了它行为如上所述:对于工作站,有24个MPI进程,每个进程有24个OpenMP线程。您正在观察OpenMPI特性与GNU OpenMP Runtimelibgomp
之间的交互
首先,OpenMP中的线程数由num threads ICV(内部控制变量)控制,设置它的方法是调用omp\u set\u num\u threads()
或在环境中设置omp\u num\u threads
。当未设置OMP\u NUM\u THREADS
且未调用OMP\u set\u NUM\u THREADS()
时,运行时可以自由选择其认为合理的默认值。在libgomp
的情况下,表示:
OMP\u NUM\u线程
指定在并行区域中使用的默认线程数。该变量的值应为逗号分隔的正整数列表;该值指定用于相应嵌套级别的线程数。默认情况下,在列表中指定多个项目将自动启用嵌套如果未定义,则每个CPU使用一个线程。
它没有提到的是,它使用各种启发式方法来确定正确的CPU数量。在Linux和Windows上,进程关联掩码用于此目的(如果您喜欢阅读代码,Linux的是)。如果进程绑定到单个逻辑CPU,则只能得到一个线程:
$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
如果将其绑定到多个逻辑CPU,则使用它们的计数:
$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
此特定于libgomp
的行为与另一个特定于openmpi的行为交互。早在2013年,OpenMPI就更改了其默认绑定策略。这些原因在某种程度上是技术原因和政治因素的混合,您可以阅读更多(Jeff是一名核心的开放MPI开发人员)
这个故事的寓意是:
始终明确设置OpenMP线程数和MPI绑定策略。对于OpenMPI,设置环境变量的方法是使用-x
:
$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
请注意,我启用了超线程,因此--绑定到核心
和--绑定到HWTREAD
会产生不同的结果,而无需显式设置OMP_NUM_线程
:
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
vs
--map by node:PE=3
为每个MPI排名提供每个节点三个处理元素(PE)。当绑定到核心时,PE就是核心。绑定到硬件线程时,PE是一个线程,应该使用--map by node:PE=#cores*#threads
,即在我的例子中,--map by node:PE=6
OpenMP运行时是否尊重MPI设置的关联掩码,是否将自己的线程关联映射到该掩码上,以及如果不遵守,该怎么办,则完全是另一回事。MPI线程是指MPI进程吗?@dreamcrash:是的,基本上我知道,但这并不能描述/解释上述行为……我完全重写了我的答案。你觉得怎么样?谢谢你的解释。绑定到核心的行为是一个很好的解释
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors