C++ OMP节中的线程数_C++_Multithreading_Openmp

C++ OMP节中的线程数

c++ multithreading

C++ OMP节中的线程数,c++,multithreading,openmp,C++,Multithreading,Openmp,我的电脑有四个核心。我正在运行Ubuntu15.10，并使用g++-fopenmp编译我有两种不同类型的工作，它们都是相互独立的：工作1和工作2。特别是，Work1应该在单个处理器上运行，而Work2应该并行化。我尝试使用omp_set_num_threads（）：假设Work2是这样的： void Work2(...){ #pragma omp parallel for for (...) ... return; } 当程序运行时，只使用两个处理器。显然，omp

我的电脑有四个核心。我正在运行Ubuntu15.10，并使用g++-fopenmp编译

我有两种不同类型的工作，它们都是相互独立的：工作1和工作2。特别是，Work1应该在单个处理器上运行，而Work2应该并行化。我尝试使用omp_set_num_threads（）：

假设Work2是这样的：

void Work2(...){
    #pragma omp parallel for
    for (...) ...

    return;
}

当程序运行时，只使用两个处理器。显然，omp_set_num_threads（）没有像我预期的那样工作。使用OpenMP可以做些什么来补救这种情况

感谢大家,

Rodrigo

首先，OpenMP标准不保证这两个部分将由不同的线程执行（第2.7.2节“

部分

构造”）：

在团队中的线程之间调度结构化块的方法是实现定义的

让两个工作例程并发执行的唯一可靠方法是使用基于线程ID的显式流控制：

#pragma omp parallel num_threads(2)
{
   if (omp_get_thread_num() == 0)
   {
      omp_set_num_threads(1);
      Work1();
   }
   else
   {
      omp_set_num_threads(3);
      Work2();
   }
}

此外，

Work2（）

中的嵌套并行区域是否将使用多个线程取决于多种因素的组合。这些因素包括若干内部控制变量（ICV）的值：

嵌套变量控制是否启用嵌套并行；从
```
OMP_-NESTED
```
的值初始化，并在运行时通过调用
```
OMP_-set_-NESTED（）
```
进行设置
线程限制变量（自OpenMP 3.0起）设置所有活动并行区域中所有OpenMP线程总数的上限；从
```
OMP\u THREAD\u LIMIT
```
的值初始化，并通过应用
```
THREAD\u LIMIT
```
子句在运行时设置
最大活动级别（自OpenMP 3.0以来）限制活动平行区域的深度；从
```
OMP\u MAX\u ACTIVE\u LEVELS
```
的值初始化，并通过调用
```
OMP\u set\u MAX\u ACTIVE\u LEVELS（）
```
进行设置

如果nest var为false，则其他ICV的值无关紧要-嵌套并行被禁用。这是标准强制的默认值，因此必须显式启用嵌套并行

如果启用嵌套并行，则它仅在最高活动级别的级别上工作，最外层并行区域为级别1，第一个嵌套并行区域为级别2，等等。该ICV的默认值是实现支持的嵌套并行级别数。更深层次的并行区域被禁用，即仅使用其主线程串行执行

如果启用了嵌套并行，并且特定并行区域嵌套在不超过最大活动级别的级别上，则它是否将并行执行取决于thread-limit-var的值。在您的情况下，任何小于4的值都将导致

Work2（）

无法使用三个线程执行

以下测试程序可用于检查这些ICV之间的相互作用：

#include <stdio.h>
#include <omp.h>

void Work1(void)
{
   printf("Work1 started by tid %d/%d\n",
      omp_get_thread_num(), omp_get_num_threads());
}

void Work2(void)
{
   printf("Work2 started by tid %d/%d\n",
      omp_get_thread_num(), omp_get_num_threads());

   #pragma omp parallel for schedule(static)
   for (int i = 0; i < 3; i++)
   {
      printf("Work2 nested loop: %d by tid %d/%d\n", i,
         omp_get_thread_num(), omp_get_num_threads());
   }
}

int main(void)
{
   #pragma omp parallel num_threads(2)
   {
      if (omp_get_thread_num() == 0)
      {
         omp_set_num_threads(1);
         Work1();
      }
      else
      {
         omp_set_num_threads(3);
         Work2();
      }
   }
   return 0;
}

最外面的平行区域处于活动状态。

Work2（）

中的嵌套并行是非活动的，因为默认情况下禁用了嵌套并行

$ OMP_NESTED=TRUE ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/3
Work2 nested loop: 1 by tid 1/3
Work2 nested loop: 2 by tid 2/3

所有并行区域都处于活动状态并并行执行

$ OMP_NESTED=TRUE OMP_MAX_ACTIVE_LEVELS=1 ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/1
Work2 nested loop: 1 by tid 0/1
Work2 nested loop: 2 by tid 0/1

尽管启用了嵌套并行，但只有一个级别的并行可以处于活动状态，因此嵌套区域以串行方式执行。对于OpenMP 3.0之前的编译器，例如GCC 4.4，设置

OMP\u MAX\u ACTIVE\u级别

无效

$ OMP_NESTED=TRUE OMP_THREAD_LIMIT=3 ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/2
Work2 nested loop: 2 by tid 1/2
Work2 nested loop: 1 by tid 0/2

嵌套区域处于活动状态，但仅使用两个线程执行，因为设置

OMP\u thread\u limit

施加了全局线程限制

如果您已经启用了嵌套并行，那么活动级别的数量没有限制，并且线程限制足够高，那么您的程序就没有理由不同时使用四个CPU核

。。。除非进程和/或线程绑定生效。绑定控制不同OpenMP线程与可用CPU的关联性。对于大多数OpenMP运行时，默认情况下禁用线程绑定，并且OS调度器可以在其认为合适的可用内核之间自由移动线程。然而，运行时通常尊重应用于整个进程的关联掩码。如果使用类似于

taskset

的方法将进程固定/绑定到两个逻辑CPU，那么无论生成多少线程，它们都将在两个逻辑CPU和分时共享上运行。通过设置

GOMP\u CPU\u AFFINITY

和/或

OMP\u PROC\u BIND

控制GCC线程绑定，通过设置

OMP\u PLACES

控制支持OpenMP 4.0的最新版本的线程绑定

如果未绑定可执行文件（通过检查

/proc/$PID/status

中的

CPU allowed

值进行验证，其中

$PID

是正在运行的OpenMP进程的PID），

GOMP\u CPU亲缘关系

OMP\u proc\u BIND

和

OMP\u PLACES

都未设置，则启用嵌套并行性，活动并行级别或线程数没有限制，像

top

或

htop

这样的程序仍然显示只使用了两个逻辑CPU，然后，您的程序逻辑有问题，OpenMP环境没有问题。

在GCC部分中，是作为并行for和if-then或switch-case语句的组合实现的（或者至少在某一点上是以这种方式实现的）。为什么不自己做这样的事呢

#pragma omp parallel
{
    unsigned ithread = omp_get_thread_num();
    unsigned nthread = omp_get_num_threads();
    if(ithread==0)  work1();
    if(ithread!=0 || nthread==1) {
        //distribute work2 to nthread-1 threads.
        unsigned start = nthread==1 ? 0 : (ithread-1)*N/(nthread-1);
        unsigned end   = nthread==1 ? N :     ithread*N/(nthread-1);
        for(unsigned i=start; i<end; i++) {
            //work2 per iteration
        }
    }
}

这种方法的一个缺点是动态调度的开销高于静态调度。但是，它不再要求在特定线程上运行

work

，而且如果执行

work1

的线程在执行

work2

的线程之前完成，则该线程可以帮助执行

work2

。因此，这种方法可以更好地平衡负载。

可以使用的

num\u threads

子句，而不是

omp\u set\u num\u threads

$ OMP_NESTED=TRUE OMP_THREAD_LIMIT=3 ./nested
Work1: started by tid 0/2
Work2: started by tid 1/2
Work2 nested loop: 0 by tid 0/2
Work2 nested loop: 2 by tid 1/2
Work2 nested loop: 1 by tid 0/2

#pragma omp parallel
{
    unsigned ithread = omp_get_thread_num();
    unsigned nthread = omp_get_num_threads();
    if(ithread==0)  work1();
    if(ithread!=0 || nthread==1) {
        //distribute work2 to nthread-1 threads.
        unsigned start = nthread==1 ? 0 : (ithread-1)*N/(nthread-1);
        unsigned end   = nthread==1 ? N :     ithread*N/(nthread-1);
        for(unsigned i=start; i<end; i++) {
            //work2 per iteration
        }
    }
}

#pragma omp parallel
{ 
    //while(1) {
    #pragma omp single nowait
    work1();
    #pragma omp for schedule(dynamic) nowait
    for(int i=0; i<N; i++) {
        //work2 to per iteration
    }
    //}
}