C++ 在main（）之外初始化std:：vector会导致性能下降（多线程）_C++_Multithreading_Performance_C++11_Vector

C++ 在main（）之外初始化std:：vector会导致性能下降（多线程）

c++ multithreading performance c++11 vector

C++ 在main（）之外初始化std:：vector会导致性能下降（多线程）,c++,multithreading,performance,c++11,vector,C++,Multithreading,Performance,C++11,Vector,我正在写一个路径跟踪器作为编程练习。昨天，我终于决定实现多线程——它工作得很好。然而，当我将我在main（）中编写的测试代码包装到一个单独的呈现程序类中时，我注意到一个显著且一致的性能下降。简而言之，在main（）之外的任何地方填充std:：vector似乎都会导致使用其元素的线程性能更差。我设法用简化的代码隔离并重现了这个问题，但不幸的是，我仍然不知道为什么会发生这种情况，也不知道如何解决它性能下降非常明显且一致： 97 samples - time = 28.154226s, per

我正在写一个路径跟踪器作为编程练习。昨天，我终于决定实现多线程——它工作得很好。然而，当我将我在

main（）

中编写的测试代码包装到一个单独的

呈现程序

类中时，我注意到一个显著且一致的性能下降。简而言之，在

main（）

之外的任何地方填充

std:：vector

似乎都会导致使用其元素的线程性能更差。我设法用简化的代码隔离并重现了这个问题，但不幸的是，我仍然不知道为什么会发生这种情况，也不知道如何解决它

性能下降非常明显且一致：

  97 samples - time = 28.154226s, per sample = 0.290250s, per sample/th = 1.741498
  99 samples - time = 28.360723s, per sample = 0.286472s, per sample/th = 1.718832
 100 samples - time = 29.335468s, per sample = 0.293355s, per sample/th = 1.760128

vs.

  98 samples - time = 30.197734s, per sample = 0.308140s, per sample/th = 1.848841
  99 samples - time = 30.534240s, per sample = 0.308427s, per sample/th = 1.850560
 100 samples - time = 30.786519s, per sample = 0.307865s, per sample/th = 1.847191

我最初在这个问题中发布的代码可以在这里找到：或者在编辑历史中找到

我创建了一个struct

foo

，它应该模拟我的

renderer

类的行为，并负责在其构造函数中初始化路径跟踪上下文。有趣的是，当我删除

foo

的构造函数主体，而直接从

main（）

初始化

contexts

）时：

std:：向量上下文；//可以在堆栈上，也可以在堆上，这无关紧要
foo F（cam、场景、bvh、宽度、高度、渲染线程、上下文）；//不再填充上下文`
保留（渲染线程）；
对于（int i=0；i


表演恢复正常了。但是，如果我将这三行包装成一个单独的函数并从这里调用它，情况会更糟。我在这里看到的唯一模式是
在main（）
之外填充上下文
向量会导致问题
我最初认为这是一个对齐/缓存问题，所以我尝试将path\u tracer
s与Boost的aligned\u分配器和TBB的cache\u aligned\u分配器对齐，但没有结果。事实证明，即使只有一个线程在运行，这个问题仍然存在。
我怀疑这一定是某种疯狂的编译器优化（我使用的是-O3
），尽管这只是猜测。您是否知道此类行为的任何可能原因，以及可以采取哪些措施避免此类行为
这在gcc
10.1.0和clang
10.0.0上都会发生。目前我只使用-O3

我在这个独立的例子中再现了一个类似的问题：
#包括
#包括
#包括
#包括
#包括
#包括
结构foo
{
标准：mt19937 rng；
标准：均匀实分布区；
std：：载体buf；
int-cnt=0；
foo（int seed，int n）：
rng（种子），
距离（0,1），
buf（n，0）
{
}
void do_stuff（）
{
//做任何事
用于（自动&f:buf）
f=（f+1）*距离（rng）；
cnt++；
}
};
int main（）
{
整数N=50000000；
int线程计数=6；
结构条
{
向量线程；
std：：vector&foos；
bool active=true；
条形图（标准：：矢量和f，整数线程计数，整数n）：
福斯（f）
{
/*
食物储备（线程计数）；
对于（int i=0；i活动）
f、 做某事；
};
线程保留（线程计数）；
对于（int i=0；istd:：cerr存在与foo:：buf
的争用条件-一个线程在其中进行存储，另一个线程读取。这是未定义的行为，但在x86-64平台上，这在特定代码中是无害的

我无法重现您在英特尔i9-9900KS上的观察结果，两种型号的每个样本都打印相同的
统计数据
用gcc-8.4编译，g++-o release/gcc/test.o-c-pthread-m{arch，tune}=native-std=gnu++17-g-O3-ffast math-falign-{functions，loops}=64-DNDEBUG test.cc

对于int N=50000000；
每个线程在其自己的float[N]
数组上运行，该数组占用200MB。这样的数据集不适合CPU缓存，并且程序会发生大量数据缓存未命中，因为它需要从内存中获取数据：
$ perf stat -ddd ./release/gcc/test
[...]
      71474.813087      task-clock (msec)         #    6.860 CPUs utilized          
                66      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           341,942      page-faults               #    0.005 M/sec                  
   357,027,759,875      cycles                    #    4.995 GHz                      (30.76%)
   991,950,515,582      instructions              #    2.78  insn per cycle           (38.43%)
   105,609,126,987      branches                  # 1477.571 M/sec                    (38.40%)
       155,426,137      branch-misses             #    0.15% of all branches          (38.39%)
   150,832,846,580      L1-dcache-loads           # 2110.294 M/sec                    (38.41%)
     4,945,287,289      L1-dcache-load-misses     #    3.28% of all L1-dcache hits    (38.44%)
     1,787,635,257      LLC-loads                 #   25.011 M/sec                    (30.79%)
     1,103,347,596      LLC-load-misses           #   61.72% of all LL-cache hits     (30.81%)
   <not supported>      L1-icache-loads                                             
         7,457,756      L1-icache-load-misses                                         (30.80%)
   150,527,469,899      dTLB-loads                # 2106.021 M/sec                    (30.80%)
        54,966,843      dTLB-load-misses          #    0.04% of all dTLB cache hits   (30.80%)
            26,956      iTLB-loads                #    0.377 K/sec                    (30.80%)
           415,128      iTLB-load-misses          # 1540.02% of all iTLB cache hits   (30.79%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      10.419122076 seconds time elapsed


$perf stat-ddd./release/gcc/test
[...]
71474.813087任务时钟（毫秒）#使用6.860个CPU
66个上下文开关#0.001 K/sec
0 cpu迁移#0.000 K/sec
341942页错误#0.005米/秒
357027759875个周期#4.995 GHz（30.76%）
$ perf stat -ddd ./release/gcc/test
[...]
      71474.813087      task-clock (msec)         #    6.860 CPUs utilized          
                66      context-switches          #    0.001 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           341,942      page-faults               #    0.005 M/sec                  
   357,027,759,875      cycles                    #    4.995 GHz                      (30.76%)
   991,950,515,582      instructions              #    2.78  insn per cycle           (38.43%)
   105,609,126,987      branches                  # 1477.571 M/sec                    (38.40%)
       155,426,137      branch-misses             #    0.15% of all branches          (38.39%)
   150,832,846,580      L1-dcache-loads           # 2110.294 M/sec                    (38.41%)
     4,945,287,289      L1-dcache-load-misses     #    3.28% of all L1-dcache hits    (38.44%)
     1,787,635,257      LLC-loads                 #   25.011 M/sec                    (30.79%)
     1,103,347,596      LLC-load-misses           #   61.72% of all LL-cache hits     (30.81%)
   <not supported>      L1-icache-loads                                             
         7,457,756      L1-icache-load-misses                                         (30.80%)
   150,527,469,899      dTLB-loads                # 2106.021 M/sec                    (30.80%)
        54,966,843      dTLB-load-misses          #    0.04% of all dTLB cache hits   (30.80%)
            26,956      iTLB-loads                #    0.377 K/sec                    (30.80%)
           415,128      iTLB-load-misses          # 1540.02% of all iTLB cache hits   (30.79%)
   <not supported>      L1-dcache-prefetches                                        
   <not supported>      L1-dcache-prefetch-misses                                   

      10.419122076 seconds time elapsed