C++ MPI大小和OpenMP线程数

C++ MPI大小和OpenMP线程数,c++,multithreading,openmp,openmpi,C++,Multithreading,Openmp,Openmpi,我试图编写一个混合OpenMP/MPI程序,因此我试图理解OpenMP线程数量和MPI进程之间的相关性。因此,我创建了一个小测试程序: #include <iostream> #include <mpi.h> #include <thread> #include <sstream> #include <omp.h> int main(int args, char *argv[]) { int rank, nprocs, thr

我试图编写一个混合OpenMP/MPI程序,因此我试图理解OpenMP线程数量和MPI进程之间的相关性。因此,我创建了一个小测试程序:

#include <iostream>
#include <mpi.h>
#include <thread>
#include <sstream>
#include <omp.h>

int main(int args, char *argv[]) {
    int rank, nprocs, thread_id, nthreads, cxx_procs;
    MPI_Init(&args, &argv);

    MPI_Comm_size(MPI_COMM_WORLD, &nprocs);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    #pragma omp parallel private(thread_id, nthreads, cxx_procs) 
    {
        thread_id = omp_get_thread_num();
        nthreads = omp_get_num_threads();
        cxx_procs = std::thread::hardware_concurrency();
        std::stringstream omp_stream;
        omp_stream << "I'm thread " << thread_id 
        << " out of " << nthreads 
        << " on MPI process nr. " << rank 
        << " out of " << nprocs 
        << ", while hardware_concurrency reports " << cxx_procs 
        << " processors\n";
        std::cout << omp_stream.str();
    }

    MPI_Finalize();
    return 0;
}
使用
gcc-9.3.1
openmpi3
。 现在,当使用4c/8t和
/omp\u mpi
在i7-6700上执行它时,我得到以下输出

I'm thread 1 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
i、 e.如预期。
当使用
mpirun-n1omp\umpi
执行它时,我希望得到相同的结果,但我得到的是

I'm thread 0 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 1, while hardware_concurrency reports 8 processors
其他线程在哪里?当在两个MPI进程上执行它时,我得到

I'm thread 0 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 1 out of 2, while hardware_concurrency reports 8 processors
I'm thread 0 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
I'm thread 1 out of 2 on MPI process nr. 0 out of 2, while hardware_concurrency reports 8 processors
i、 仍然只有两个OpenMP线程,但在四个MPI进程上执行时,我得到

I'm thread 1 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 6 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 1 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 0 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 4 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 0 out of 4, while hardware_concurrency reports 8 processors
I'm thread 3 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 7 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 2 out of 4, while hardware_concurrency reports 8 processors
I'm thread 2 out of 8 on MPI process nr. 1 out of 4, while hardware_concurrency reports 8 processors
I'm thread 5 out of 8 on MPI process nr. 3 out of 4, while hardware_concurrency reports 8 processors

现在我突然发现每个MPI进程有八个OpenMP线程。这种变化从何而来?

mpirun的手册页解释:

如果您只是想了解如何运行MPI应用程序, 您可能希望使用以下形式的命令行:

  % mpirun [ -np X ] [ --hostfile <filename> ]  <program>
%mpirun[-np X][--hostfile]
这将在您当前的数据库中运行X个副本 运行时环境(…)

请注意,mpirun会自动绑定到的进程 v1.8系列的开始。在本例中使用了三种绑定模式 没有任何进一步的指示:

  Bind to core:     when the number of processes is <= 2
  Bind to socket:   when the number of processes is > 2
  Bind to none:     when oversubscribed
绑定到核心:当进程数为2时
绑定到无:当超额订阅时
如果应用程序使用线程,那么可能需要确保 你要么根本不受约束 (通过指定--bind to none)或绑定到多个核心 使用适当的绑定级别或特定编号 每个应用程序进程的处理元素数量

现在,如果指定1个或2个MPI进程,
mpirun
默认为
--bind to core
,这将导致每个MPI进程有2个线程。
但是,如果指定4个MPI进程,那么mpirun默认为
--bind to socket
,并且每个进程有8个线程,因为您的机器是一个套接字。我在笔记本电脑(1s/2c/4t)、工作站(2个插座,每个插座12个核,每个核2个线程)和程序(没有
np
参数)上测试了它行为如上所述:对于工作站,有24个MPI进程,每个进程有24个OpenMP线程。

您正在观察OpenMPI特性与GNU OpenMP Runtime
libgomp
之间的交互

首先,OpenMP中的线程数由num threads ICV(内部控制变量)控制,设置它的方法是调用
omp\u set\u num\u threads()
或在环境中设置
omp\u num\u threads
。当未设置
OMP\u NUM\u THREADS
且未调用
OMP\u set\u NUM\u THREADS()
时,运行时可以自由选择其认为合理的默认值。在
libgomp
的情况下,表示:

OMP\u NUM\u线程

指定在并行区域中使用的默认线程数。该变量的值应为逗号分隔的正整数列表;该值指定用于相应嵌套级别的线程数。默认情况下,在列表中指定多个项目将自动启用嵌套如果未定义,则每个CPU使用一个线程。

它没有提到的是,它使用各种启发式方法来确定正确的CPU数量。在Linux和Windows上,进程关联掩码用于此目的(如果您喜欢阅读代码,Linux的是)。如果进程绑定到单个逻辑CPU,则只能得到一个线程:

$ taskset -c 0 ./omp_mpi
I'm thread 0 out of 1 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
如果将其绑定到多个逻辑CPU,则使用它们的计数:

$ taskset -c 0,2,5 ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 1, while hardware_concurrency reports 12 processors
此特定于
libgomp
的行为与另一个特定于openmpi的行为交互。早在2013年,OpenMPI就更改了其默认绑定策略。这些原因在某种程度上是技术原因和政治因素的混合,您可以阅读更多(Jeff是一名核心的开放MPI开发人员)

这个故事的寓意是:

始终明确设置OpenMP线程数和MPI绑定策略。对于OpenMPI,设置环境变量的方法是使用
-x

$ mpiexec -n 2 --map-by node:PE=3 --bind-to core -x OMP_NUM_THREADS=3 ./ompi_mpi   
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
请注意,我启用了超线程,因此
--绑定到核心
--绑定到HWTREAD
会产生不同的结果,而无需显式设置
OMP_NUM_线程

mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi 
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
vs

--map by node:PE=3
为每个MPI排名提供每个节点三个处理元素(PE)。当绑定到核心时,PE就是核心。绑定到硬件线程时,PE是一个线程,应该使用
--map by node:PE=#cores*#threads
,即在我的例子中,
--map by node:PE=6


OpenMP运行时是否尊重MPI设置的关联掩码,是否将自己的线程关联映射到该掩码上,以及如果不遵守,该怎么办,则完全是另一回事。

MPI线程是指MPI进程吗?@dreamcrash:是的,基本上我知道,但这并不能描述/解释上述行为……我完全重写了我的答案。你觉得怎么样?谢谢你的解释。绑定到核心的行为是一个很好的解释
mpiexec -n 2 --map-by node:PE=3 --bind-to core ./ompi_mpi 
I'm thread 0 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 5 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 6 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 3 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 4 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 6 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
mpiexec -n 2 --map-by node:PE=3 --bind-to hwthread ./ompi_mpi
I'm thread 0 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 0 out of 2, while hardware_concurrency reports 12 processors
I'm thread 0 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 2 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors
I'm thread 1 out of 3 on MPI process nr. 1 out of 2, while hardware_concurrency reports 12 processors