C++ cpu数量的增加会降低性能,cpu负载不变且无通信
我遇到了一个我无法解释的有趣现象。我还没有在网上找到答案,因为大多数帖子都涉及到弱伸缩性,从而导致通信开销 下面是一小段代码来说明这个问题。这在不同的语言中进行了测试,结果相似,因此使用了多个标记C++ cpu数量的增加会降低性能,cpu负载不变且无通信,c++,c,fortran,mpi,C++,C,Fortran,Mpi,我遇到了一个我无法解释的有趣现象。我还没有在网上找到答案,因为大多数帖子都涉及到弱伸缩性,从而导致通信开销 下面是一小段代码来说明这个问题。这在不同的语言中进行了测试,结果相似,因此使用了多个标记 #include <mpi.h> #include <stdio.h> #include <time.h> int main() { MPI_Init(NULL,NULL); int wsize; MPI_Comm_size(MPI_C
#include <mpi.h>
#include <stdio.h>
#include <time.h>
int main() {
MPI_Init(NULL,NULL);
int wsize;
MPI_Comm_size(MPI_COMM_WORLD, &wsize);
int wrank;
MPI_Comm_rank(MPI_COMM_WORLD, &wrank);
clock_t t;
MPI_Barrier(MPI_COMM_WORLD);
t=clock();
int imax = 10000000;
int jmax = 1000;
for (int i=0; i<imax; i++) {
for (int j=0; j<jmax; j++) {
//nothing
}
}
t=clock()-t;
printf( " proc %d took %f seconds.\n", wrank,(float)t/CLOCKS_PER_SEC );
MPI_Finalize();
return 0;
}
给予
proc 0 took 22.262777 seconds.
proc 18 took 24.440767 seconds.
proc 0 took 24.454365 seconds.
proc 4 took 24.461191 seconds.
proc 15 took 24.467632 seconds.
proc 14 took 24.469728 seconds.
proc 7 took 24.469809 seconds.
proc 5 took 24.461639 seconds.
proc 11 took 24.484224 seconds.
proc 9 took 24.491638 seconds.
proc 2 took 24.484953 seconds.
proc 17 took 24.490984 seconds.
proc 16 took 24.502146 seconds.
proc 3 took 24.513380 seconds.
proc 1 took 24.541555 seconds.
proc 8 took 24.539808 seconds.
proc 13 took 24.540005 seconds.
proc 12 took 24.556068 seconds.
proc 10 took 24.528328 seconds.
proc 19 took 24.585297 seconds.
proc 6 took 24.611254 seconds.
但是
给予
proc 0 took 22.262777 seconds.
proc 18 took 24.440767 seconds.
proc 0 took 24.454365 seconds.
proc 4 took 24.461191 seconds.
proc 15 took 24.467632 seconds.
proc 14 took 24.469728 seconds.
proc 7 took 24.469809 seconds.
proc 5 took 24.461639 seconds.
proc 11 took 24.484224 seconds.
proc 9 took 24.491638 seconds.
proc 2 took 24.484953 seconds.
proc 17 took 24.490984 seconds.
proc 16 took 24.502146 seconds.
proc 3 took 24.513380 seconds.
proc 1 took 24.541555 seconds.
proc 8 took 24.539808 seconds.
proc 13 took 24.540005 seconds.
proc 12 took 24.556068 seconds.
proc 10 took 24.528328 seconds.
proc 19 took 24.585297 seconds.
proc 6 took 24.611254 seconds.
以及不同数量CPU的介于之间的值
htop还显示RAM消耗增加(1芯的VIRT约为100M,20芯的VIRT约为300M)。尽管这可能与mpi通信器的大小有关
最后,它肯定与问题的大小有关(因此,无论循环大小如何,都不会导致恒定延迟的通信开销)。事实上,将imax减小到10000会使walltimes变得类似
1核心:
proc 0 took 0.028439 seconds.
20芯:
proc 1 took 0.027880 seconds.
proc 12 took 0.027880 seconds.
proc 8 took 0.028024 seconds.
proc 16 took 0.028135 seconds.
proc 17 took 0.028094 seconds.
proc 19 took 0.028098 seconds.
proc 7 took 0.028265 seconds.
proc 9 took 0.028051 seconds.
proc 13 took 0.028259 seconds.
proc 18 took 0.028274 seconds.
proc 5 took 0.028087 seconds.
proc 6 took 0.028032 seconds.
proc 14 took 0.028385 seconds.
proc 15 took 0.028429 seconds.
proc 0 took 0.028379 seconds.
proc 2 took 0.028367 seconds.
proc 3 took 0.028291 seconds.
proc 4 took 0.028419 seconds.
proc 10 took 0.028419 seconds.
proc 11 took 0.028404 seconds.
它已经在几台机器上进行了试验,取得了类似的结果。
也许我们遗漏了一些非常简单的东西
谢谢你的帮助 处理器,涡轮频率受温度限制
现代处理器受到热设计功率(TDP)的限制。每当处理器处于冷态时,单核可能会加速到turbo倍频器。当磁芯处于热态或多个非空闲状态时,磁芯速度会降低到保证的基本速度。基本速度和涡轮速度之间的差异通常在400MHz左右。AVX或FMA3的速度甚至可能低于基本速度。调度多个内核/线程可能需要相当长的时间。您需要在每个线程中做的工作远远多于调度程序对它们进行调度所需的工作。另外,如果代码是内存带宽限制的话,这会影响性能。C++和C++是两种非常不同的语言。除非有一个很好的理由,你不应该标记这两个!顺便问一下:fortran与此有什么关系?@muXXmit2X我放置了几个标记,因为这是用不同语言测试的,结果相似。我应该提到的。你考虑过对三级缓存的影响吗?随着更多处理器争夺有限的缓存,将有更多的内存读取到RAM。您是否将MPI任务绑定到core?你们使用的MPI任务比内核要多吗?我认为热保护切入不太可能是造成减速的原因。处理器似乎运行了大约25秒。一种方法是禁用Turbo模式并再次运行测试