C++ OpenMP-嵌套for循环在外循环之前具有并行时变得更快。为什么？_C++_For Loop_Nested_Openmp_Knapsack Problem

C++ OpenMP-嵌套for循环在外循环之前具有并行时变得更快。为什么？

c++ for-loop

C++ OpenMP-嵌套for循环在外循环之前具有并行时变得更快。为什么？,c++,for-loop,nested,openmp,knapsack-problem,C++,For Loop,Nested,Openmp,Knapsack Problem,我目前正在实现一个动态规划算法来解决背包问题。因此，我的代码有两个for循环，一个外部循环和一个内部循环从逻辑的角度来看，我可以并行化内部for循环，因为那里的计算是相互独立的。由于依赖关系，外部for循环无法并行化所以这是我的第一个方法： for(int i=1; i < itemRows; i++){ int itemsIndex = i-1; int itemWeight = integerItems[itemsIndex].weight;

我目前正在实现一个动态规划算法来解决背包问题。因此，我的代码有两个for循环，一个外部循环和一个内部循环

从逻辑的角度来看，我可以并行化内部for循环，因为那里的计算是相互独立的。由于依赖关系，外部for循环无法并行化

所以这是我的第一个方法：

for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp parallel for if(weightColumns > THRESHOLD)
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
}

for（int i=1；i阈值）
for（int c=1；c


代码运行良好，算法正确地解决了问题。
然后我考虑优化它，因为我不确定OpenMP的线程管理是如何工作的。我希望在每次迭代期间防止不必要的线程初始化，因此我在外部循环周围放置了一个外部并行块
第二种方法：
#pragma omp parallel if(weightColumns > THRESHOLD)
{
    for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp for
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
     }
}

#pragma omp parallel if（权重列>阈值）
{
对于（int i=1；i

这有一个不必要的副作用：并行块中的所有内容现在都将执行n次，其中n是可用内核的数量。我已经尝试使用pragmassingle
和critical
来强制在一个线程中执行外部for循环，但是我无法通过多个线程计算内部循环，除非我打开一个新的并行块（但这样就无法提高速度）。但别担心，因为好处是：这不会影响结果。这些问题仍然得到了正确的解决
奇怪的是：第二种方法比第一种快
这怎么可能？我的意思是，虽然外部for循环计算了n次（并行），内部for循环在n个核之间分布了n次，但它比第一种方法要快，第一种方法只计算外部循环一次，并且平均分配内部for循环的工作负载
起初我在想：“嗯，是的，这可能是因为线程管理”，但后来我读到OpenMP汇集了实例化的线程，这与我的假设背道而驰。然后我禁用了编译器优化（编译器标志-O0），以检查它是否与此有关。但这并不影响测量
你们中有谁能解释一下这个问题吗
解决包含7500件物品且最大容量为45000件的背包问题的测量时间（创建7500x45000的矩阵，远远超过代码中使用的阈值变量）：

方法1:~0.88秒
方法2:~0.52s

提前感谢,
菲尼勒
编辑：
for(int i=1; i < itemRows; i++){
        int itemsIndex = i-1;
        int itemWeight = integerItems[itemsIndex].weight;
        int itemWorth = integerItems[itemsIndex].worth;

        #pragma omp parallel for if(weightColumns > THRESHOLD)
        for(int c=1; c < weightColumns; c++){
            if(c < itemWeight){
                table[i][c] = table[i-1][c];
            }else{
                int worthOfNotUsingItem = table[i-1][c];
                int worthOfUsingItem = itemWorth + table[i-1][c-itemWeight];
                table[i][c] = worthOfNotUsingItem < worthOfUsingItem ? worthOfUsingItem : worthOfNotUsingItem;
            }
        }
}

更复杂问题的测量：
向问题中添加了2500项（从7500项到10000项）（由于内存原因，目前无法处理更复杂的问题）

方法1:~1.19s
方法2:~0.71s

EDIT2：
关于编译器优化，我错了。这不影响测量。至少我不能重现我以前测量的差异。我根据这一点编辑了问题文本。
我认为原因很简单，因为您将#pragma omp parallel
放在范围外级别（第二个版本），因此调用线程的开销更少
换句话说，在第一个版本中，在第一个循环itemRowstime中调用线程创建，而在第二个版本中，只调用一次创建。
我不知道为什么
我尝试复制一个简单的示例来说明，使用4个启用了HT的线程：
#include <iostream>
#include <vector>
#include <algorithm>
#include <omp.h>

int main()
{
    std::vector<double> v(10000);
    std::generate(v.begin(),  v.end(), []() { static double n{0.0}; return n ++;} );

    double start = omp_get_wtime();

    #pragma omp parallel // version 2
    for (auto& el :  v) 
    {
        double t = el - 1.0;
        // #pragma omp parallel // version 1
        #pragma omp for
        for (size_t i = 0; i < v.size(); i ++)
        {
            el += v[i];
            el-= t;
        }
    }
    double end = omp_get_wtime();

    std::cout << "   wall time : " << end - start << std::endl;
    // for (const auto& el :  v) { std::cout << el << ";"; }

}

#包括
#包括
#包括墙时间：0.512144

<> >代码>墙时间：0.333664 < /代码> 
 让我们先考虑一下你的代码在做什么。实际上，您的代码正在转换矩阵（2D数组），其中行的值取决于前一行，但列的值独立于其他列。让我选择一个更简单的例子
for(int i=1; i<n; i++) {
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

这实质上是并行处理单行中的列，但按顺序处理每行。i
的值仅由主线程运行
另一种并行处理列但按顺序处理每行的方法是：
方法3：
#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

在我的测试中，nowait
子句没有太大区别。这可能是因为负载是均匀的（这就是为什么在这种情况下静态调度是理想的）。如果负载更小，则nowait
可能会产生更大的差异
以下是我的四核IVB系统GCC 4.9.2上的n=3000
的时间（秒）：
method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21

这个测试可能是内存带宽限制，所以我可以选择一个更好的案例，使用更多的内存
#pragma omp parallel
for(int i=1; i<n; i++) {
    #pragma omp for nowait
    for(int j=0; j<n; j++) {
        a[i*n+j] += a[(i-1)*n+j];
    }
}

method 1: 3.00
method 2: 0.26 
method 3: 0.21
method 4: 0.21