C 奇怪的多线程性能
我正试图弄清我们在HPC应用程序中获得的一些相当令人失望的性能结果。我在Visual Studio 2010中编写了以下基准测试,它提炼了我们应用程序的精髓(许多独立的、高算术强度的操作): 机器2 双Xeon E5-2690@2.90 GHz--16个物理核,32个逻辑核,Sandy Bridge体系结构C 奇怪的多线程性能,c,multithreading,performance,cpu,intel,C,Multithreading,Performance,Cpu,Intel,我正试图弄清我们在HPC应用程序中获得的一些相当令人失望的性能结果。我在Visual Studio 2010中编写了以下基准测试,它提炼了我们应用程序的精髓(许多独立的、高算术强度的操作): 机器2 双Xeon E5-2690@2.90 GHz--16个物理核,32个逻辑核,Sandy Bridge体系结构 Starting 1 threads... Elapsed time: 11.575 seconds Starting 2 threads... Elapsed time: 11.575 s
Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds
Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds
以下是我感到困惑的方面:
- 为什么Westmile机器经过的时间保持不变,直到大约6个内核,然后突然跳转,然后在10个线程以上基本保持不变?Windows是否在转移到第二个处理器之前将所有线程都塞进一个处理器,以便在一个处理器被填满后,超线程不确定地启动
- 为什么Sandy Bridge机器经过的时间基本上随着线程数线性增加,直到大约12个线程?考虑到内核的数量,12对我来说似乎不是一个有意义的数字
void makework(void *jnk) {
register int i, j;
// register double tmp = 0;
__asm {
fldz // this holds the result on the stack
}
for(j=0; j<10000; j++) {
__asm {
fldz // push i onto the stack: stack = 0, res
}
for(i=0; i<1000000; i++) {
// tmp += (double)i * (double)i;
__asm {
fld st(0) // stack: i, i, res
fld st(0) // stack: i, i, i, res
fmul // stack: i*i, i, res
faddp st(2), st(0) // stack: i, res+i*i
fld1 // stack: 1, i, res+i*i
fadd // stack: i+1, res+i*i
}
}
__asm {
fstp st(0) // pop i off the stack leaving only res in st(0)
}
}
__asm {
mov eax, dword ptr [jnk]
fstp qword ptr [eax]
}
// *((double *)jnk) = tmp;
_endthread();
}
上述机器1的结果如下:
Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds
因此,一个线程的速度大约慢9%(inc-eax与fld1和faddp之间的差异,也许?),而当所有物理内核都被填满时,速度几乎慢了一倍(这是超线程所期望的)。但是,仅从6个线程开始的性能下降的令人费解的方面仍然存在…在我的笔记本电脑上,有2个物理内核和4个逻辑内核,我得到:
<br>
Starting 1 threads... Elapsed time: 11.638 seconds<br>
Starting 2 threads... Elapsed time: 12.418 seconds<br>
Starting 3 threads... Elapsed time: 13.556 seconds<br>
Starting 4 threads... Elapsed time: 14.929 seconds<br>
Starting 5 threads... Elapsed time: 20.811 seconds<br>
Starting 6 threads... Elapsed time: 22.776 seconds<br>
Starting 7 threads... Elapsed time: 27.160 seconds<br>
Starting 8 threads... Elapsed time: 30.249 seconds<br>
正在启动1个线程。。。运行时间:11.638秒
正在启动2个线程。。。运行时间:12.418秒
正在启动3个线程。。。运行时间:13.556秒
正在启动4个线程。。。运行时间:14.929秒
正在启动5个线程。。。运行时间:20.811秒
正在启动6个线程。。。运行时间:22.776秒
正在启动7个线程。。。运行时间:27.160秒
正在启动8个线程。。。运行时间:30.249秒
这表明一旦我们有超过1个线程,性能就会下降
我怀疑原因是函数makework()正在进行内存访问。在Visual Studio 2010中,您可以通过在_tmain()的第一行上设置断点来看到这一点。当您点击断点时,按Ctrl-Alt-D以查看反汇编窗口。在括号中看到寄存器名的任何地方(例如[esp]),都是内存访问。CPU级别1内存缓存带宽正在饱和。你可以用修改过的makework()来测试这个理论
void制作(void*jnk){
双tmp=0;
挥发性双*p;
int i;
int j;
p=(双*)jnk;
对于(j=0;j在我有2个物理核和4个逻辑核的笔记本电脑上,我得到:
<br>
Starting 1 threads... Elapsed time: 11.638 seconds<br>
Starting 2 threads... Elapsed time: 12.418 seconds<br>
Starting 3 threads... Elapsed time: 13.556 seconds<br>
Starting 4 threads... Elapsed time: 14.929 seconds<br>
Starting 5 threads... Elapsed time: 20.811 seconds<br>
Starting 6 threads... Elapsed time: 22.776 seconds<br>
Starting 7 threads... Elapsed time: 27.160 seconds<br>
Starting 8 threads... Elapsed time: 30.249 seconds<br>
正在启动1个线程…运行时间:11.638秒
启动2个线程…运行时间:12.418秒
启动3个线程…运行时间:13.556秒
正在启动4个线程…运行时间:14.929秒
启动5个线程…运行时间:20.811秒
启动6个线程…运行时间:22.776秒
启动7个线程…运行时间:27.160秒
正在启动8个线程…运行时间:30.249秒
这表明一旦我们有超过1个线程,性能就会下降
我怀疑原因是函数makework()正在进行内存访问。在Visual Studio 2010中,可以通过在_tmain()的第1行设置断点来看到这一点。当遇到断点时,按Ctrl-Alt-D以查看反汇编窗口。在括号中看到寄存器名的任何位置(例如[esp]),这是一种内存访问。CPU上的1级内存缓存带宽正在饱和。您可以使用修改后的makework()测试此理论
void制作(void*jnk){
双tmp=0;
挥发性双*p;
int i;
int j;
p=(双*)jnk;
对于(j=0;j(可能的解释)您检查过这些机器上的后台活动了吗?可能会发生这样的情况,操作系统无法将其所有内核完全奉献给您。在您的机器1上,当您开始占据一半以上的内核时,将开始相当大的增长。您的线程可能会与其他线程竞争资源
您可能还需要检查计算机/帐户上是否存在不允许获取所有可用资源的限制和域策略。(可能的解释)您检查过这些机器上的后台活动了吗?可能会发生这样的情况,操作系统无法将其所有内核全部奉献给您。在您的机器1上,当您开始占用超过一半的内核时,将开始相当大的增长。您的线程可能会与其他线程竞争资源
您可能还需要检查您的计算机/帐户上是否存在不允许获取所有可用资源的限制和域策略。好的,现在我们已经排除了内存饱和理论(尽管-x87?哎哟,不要期望有太多性能。如果您能接受SSE/AVX提供的功能,请尝试切换到SSE/AVX)。核心扩展仍然有意义,让我们看看您使用的CPU型号:
你能确认这些是正确的模型吗
Intel® Xeon® Processor X5690 (12M Cache, 3.46 GHz, 6.40 GT/s Intel® QPI)
如果是这样的话,那么第一个确实有6个物理核(12个逻辑核),第二个有8个物理核(16个逻辑核)。想想看,我认为在这几代人中,一个套接字上的核数不可能更高,所以这是有意义的,它完全符合你的数字
编辑:
在多插槽系统上,操作系统可能更喜欢单插槽,而逻辑核心仍然可用。这可能取决于确切的版本,但对于win server 2008,有一个有趣的问题
void makework(void *jnk) {
double tmp = 0;
volatile double *p;
int i;
int j;
p=(double*)jnk;
for(j=0; j<100000000; j++) {
for(i=0; i<100; i++) {
tmp = tmp+(double)i*(double)i;
}
*p=tmp;
}
*p = tmp;
_endthread();
}
Starting 1 threads... Elapsed time: 11.684 seconds<br>
Starting 2 threads... Elapsed time: 13.760 seconds<br>
Starting 3 threads... Elapsed time: 14.445 seconds<br>
Starting 4 threads... Elapsed time: 17.519 seconds<br>
Starting 5 threads... Elapsed time: 23.369 seconds<br>
Starting 6 threads... Elapsed time: 25.491 seconds<br>
Starting 7 threads... Elapsed time: 30.155 seconds<br>
Starting 8 threads... Elapsed time: 34.460 seconds<br>
Intel® Xeon® Processor X5690 (12M Cache, 3.46 GHz, 6.40 GT/s Intel® QPI)
Intel® Xeon® Processor E5-2690 (20M Cache, 2.90 GHz, 8.00 GT/s Intel® QPI)
When the OS boots it starts with socket 1 and enumerates all logical processors:
on socket 1 it enumerates logical processors 1-20
on socket 2 it enumerates logical processors 21-40
on socket 3 it enumerates logical processors 41-60
on socket 4 it would see 61-64
void spawnthreads(int num) {
ULONG_PTR masks[] = { // for my system; YMMV
0x1, 0x4, 0x10, 0x40, 0x100, 0x400, 0x1000, 0x4000, 0x10000, 0x40000,
0x100000, 0x400000, 0x2, 0x8, 0x20, 0x80, 0x200, 0x800, 0x2000, 0x8000};
HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
double *junk = (double *)malloc(num*sizeof(double));
printf("Starting %i threads... ", num);
for(int i=0; i<num; i++) {
hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
SetThreadAffinityMask(hThreads[i], masks[i]);
}
int start = GetTickCount();
WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
int end = GetTickCount();
FILE *fp = fopen("makework.log", "a+");
fprintf(fp, "%i,%.3f,%f\n", num, (double)(end-start)/1000.0, junk[0]);
fclose(fp);
printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
free(hThreads);
}
Starting 1 threads... Elapsed time: 12.558 seconds
Starting 2 threads... Elapsed time: 12.558 seconds
Starting 3 threads... Elapsed time: 12.589 seconds
Starting 4 threads... Elapsed time: 12.652 seconds
Starting 5 threads... Elapsed time: 12.621 seconds
Starting 6 threads... Elapsed time: 12.777 seconds
Starting 7 threads... Elapsed time: 12.636 seconds
Starting 8 threads... Elapsed time: 12.886 seconds
Starting 9 threads... Elapsed time: 13.057 seconds
Starting 10 threads... Elapsed time: 12.714 seconds
Starting 11 threads... Elapsed time: 12.777 seconds
Starting 12 threads... Elapsed time: 12.668 seconds
Starting 13 threads... Elapsed time: 26.489 seconds
Starting 14 threads... Elapsed time: 26.505 seconds
Starting 15 threads... Elapsed time: 26.505 seconds
Starting 16 threads... Elapsed time: 26.489 seconds
Starting 17 threads... Elapsed time: 26.489 seconds
Starting 18 threads... Elapsed time: 26.676 seconds
Starting 19 threads... Elapsed time: 26.770 seconds
Starting 20 threads... Elapsed time: 26.489 seconds