Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/multithreading/4.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C 奇怪的多线程性能_C_Multithreading_Performance_Cpu_Intel - Fatal编程技术网

C 奇怪的多线程性能

C 奇怪的多线程性能,c,multithreading,performance,cpu,intel,C,Multithreading,Performance,Cpu,Intel,我正试图弄清我们在HPC应用程序中获得的一些相当令人失望的性能结果。我在Visual Studio 2010中编写了以下基准测试,它提炼了我们应用程序的精髓(许多独立的、高算术强度的操作): 机器2 双Xeon E5-2690@2.90 GHz--16个物理核,32个逻辑核,Sandy Bridge体系结构 Starting 1 threads... Elapsed time: 11.575 seconds Starting 2 threads... Elapsed time: 11.575 s

我正试图弄清我们在HPC应用程序中获得的一些相当令人失望的性能结果。我在Visual Studio 2010中编写了以下基准测试,它提炼了我们应用程序的精髓(许多独立的、高算术强度的操作):

机器2 双Xeon E5-2690@2.90 GHz--16个物理核,32个逻辑核,Sandy Bridge体系结构

Starting 1 threads... Elapsed time: 11.575 seconds
Starting 2 threads... Elapsed time: 11.575 seconds
Starting 3 threads... Elapsed time: 11.591 seconds
Starting 4 threads... Elapsed time: 11.684 seconds
Starting 5 threads... Elapsed time: 11.825 seconds
Starting 6 threads... Elapsed time: 12.324 seconds
Starting 7 threads... Elapsed time: 14.992 seconds
Starting 8 threads... Elapsed time: 15.803 seconds
Starting 9 threads... Elapsed time: 16.520 seconds
Starting 10 threads... Elapsed time: 17.098 seconds
Starting 11 threads... Elapsed time: 17.472 seconds
Starting 12 threads... Elapsed time: 17.519 seconds
Starting 13 threads... Elapsed time: 17.395 seconds
Starting 14 threads... Elapsed time: 17.176 seconds
Starting 15 threads... Elapsed time: 16.973 seconds
Starting 16 threads... Elapsed time: 17.144 seconds
Starting 17 threads... Elapsed time: 17.129 seconds
Starting 18 threads... Elapsed time: 17.581 seconds
Starting 19 threads... Elapsed time: 17.769 seconds
Starting 20 threads... Elapsed time: 18.440 seconds
Starting 1 threads... Elapsed time: 10.249 seconds
Starting 2 threads... Elapsed time: 10.562 seconds
Starting 3 threads... Elapsed time: 10.998 seconds
Starting 4 threads... Elapsed time: 11.232 seconds
Starting 5 threads... Elapsed time: 11.497 seconds
Starting 6 threads... Elapsed time: 11.653 seconds
Starting 7 threads... Elapsed time: 11.700 seconds
Starting 8 threads... Elapsed time: 11.888 seconds
Starting 9 threads... Elapsed time: 12.246 seconds
Starting 10 threads... Elapsed time: 12.605 seconds
Starting 11 threads... Elapsed time: 13.026 seconds
Starting 12 threads... Elapsed time: 13.041 seconds
Starting 13 threads... Elapsed time: 13.182 seconds
Starting 14 threads... Elapsed time: 12.885 seconds
Starting 15 threads... Elapsed time: 13.416 seconds
Starting 16 threads... Elapsed time: 13.011 seconds
Starting 17 threads... Elapsed time: 12.949 seconds
Starting 18 threads... Elapsed time: 13.011 seconds
Starting 19 threads... Elapsed time: 13.166 seconds
Starting 20 threads... Elapsed time: 13.182 seconds
以下是我感到困惑的方面:

  • 为什么Westmile机器经过的时间保持不变,直到大约6个内核,然后突然跳转,然后在10个线程以上基本保持不变?Windows是否在转移到第二个处理器之前将所有线程都塞进一个处理器,以便在一个处理器被填满后,超线程不确定地启动

  • 为什么Sandy Bridge机器经过的时间基本上随着线程数线性增加,直到大约12个线程?考虑到内核的数量,12对我来说似乎不是一个有意义的数字

非常感谢您对处理器计数器的任何想法和建议,以衡量/改进我的基准测试。这是架构问题还是Windows问题

编辑:

正如下面所建议的,编译器做了一些奇怪的事情,因此我编写了自己的汇编代码,它做了与上面相同的事情,但将所有FP操作保留在FP堆栈上,以避免任何内存访问:

void makework(void *jnk) {
    register int i, j;
//  register double tmp = 0;
    __asm {
        fldz  // this holds the result on the stack
    }
    for(j=0; j<10000; j++) {
        __asm {
            fldz // push i onto the stack: stack = 0, res
        }
        for(i=0; i<1000000; i++) {
            // tmp += (double)i * (double)i;
            __asm {
                fld st(0)  // stack: i, i, res
                fld st(0)  // stack: i, i, i, res
                fmul       // stack: i*i, i, res
                faddp st(2), st(0) // stack: i, res+i*i
                fld1       // stack: 1, i, res+i*i
                fadd      // stack: i+1, res+i*i
            }
        }
        __asm {
            fstp st(0)   // pop i off the stack leaving only res in st(0)
        }
    }
    __asm {
        mov eax, dword ptr [jnk]
        fstp qword ptr [eax]
    }
//  *((double *)jnk) = tmp;
    _endthread();
}
上述机器1的结果如下:

Starting 1 threads... Elapsed time: 12.589 seconds
Starting 2 threads... Elapsed time: 12.574 seconds
Starting 3 threads... Elapsed time: 12.652 seconds
Starting 4 threads... Elapsed time: 12.682 seconds
Starting 5 threads... Elapsed time: 13.011 seconds
Starting 6 threads... Elapsed time: 13.790 seconds
Starting 7 threads... Elapsed time: 16.411 seconds
Starting 8 threads... Elapsed time: 18.003 seconds
Starting 9 threads... Elapsed time: 19.220 seconds
Starting 10 threads... Elapsed time: 20.124 seconds
Starting 11 threads... Elapsed time: 20.764 seconds
Starting 12 threads... Elapsed time: 20.935 seconds
Starting 13 threads... Elapsed time: 20.748 seconds
Starting 14 threads... Elapsed time: 20.717 seconds
Starting 15 threads... Elapsed time: 20.608 seconds
Starting 16 threads... Elapsed time: 20.685 seconds
Starting 17 threads... Elapsed time: 21.107 seconds
Starting 18 threads... Elapsed time: 21.451 seconds
Starting 19 threads... Elapsed time: 22.043 seconds
Starting 20 threads... Elapsed time: 22.745 seconds

因此,一个线程的速度大约慢9%(inc-eax与fld1和faddp之间的差异,也许?),而当所有物理内核都被填满时,速度几乎慢了一倍(这是超线程所期望的)。但是,仅从6个线程开始的性能下降的令人费解的方面仍然存在…

在我的笔记本电脑上,有2个物理内核和4个逻辑内核,我得到:

<br>
Starting 1 threads... Elapsed time: 11.638 seconds<br>
Starting 2 threads... Elapsed time: 12.418 seconds<br>
Starting 3 threads... Elapsed time: 13.556 seconds<br>
Starting 4 threads... Elapsed time: 14.929 seconds<br>
Starting 5 threads... Elapsed time: 20.811 seconds<br>
Starting 6 threads... Elapsed time: 22.776 seconds<br>
Starting 7 threads... Elapsed time: 27.160 seconds<br>
Starting 8 threads... Elapsed time: 30.249 seconds<br>

正在启动1个线程。。。运行时间:11.638秒
正在启动2个线程。。。运行时间:12.418秒
正在启动3个线程。。。运行时间:13.556秒
正在启动4个线程。。。运行时间:14.929秒
正在启动5个线程。。。运行时间:20.811秒
正在启动6个线程。。。运行时间:22.776秒
正在启动7个线程。。。运行时间:27.160秒
正在启动8个线程。。。运行时间:30.249秒
这表明一旦我们有超过1个线程,性能就会下降

我怀疑原因是函数makework()正在进行内存访问。在Visual Studio 2010中,您可以通过在_tmain()的第一行上设置断点来看到这一点。当您点击断点时,按Ctrl-Alt-D以查看反汇编窗口。在括号中看到寄存器名的任何地方(例如[esp]),都是内存访问。CPU级别1内存缓存带宽正在饱和。你可以用修改过的makework()来测试这个理论

void制作(void*jnk){
双tmp=0;
挥发性双*p;
int i;
int j;
p=(双*)jnk;

对于(j=0;j在我有2个物理核和4个逻辑核的笔记本电脑上,我得到:

<br>
Starting 1 threads... Elapsed time: 11.638 seconds<br>
Starting 2 threads... Elapsed time: 12.418 seconds<br>
Starting 3 threads... Elapsed time: 13.556 seconds<br>
Starting 4 threads... Elapsed time: 14.929 seconds<br>
Starting 5 threads... Elapsed time: 20.811 seconds<br>
Starting 6 threads... Elapsed time: 22.776 seconds<br>
Starting 7 threads... Elapsed time: 27.160 seconds<br>
Starting 8 threads... Elapsed time: 30.249 seconds<br>

正在启动1个线程…运行时间:11.638秒
启动2个线程…运行时间:12.418秒
启动3个线程…运行时间:13.556秒
正在启动4个线程…运行时间:14.929秒
启动5个线程…运行时间:20.811秒
启动6个线程…运行时间:22.776秒
启动7个线程…运行时间:27.160秒
正在启动8个线程…运行时间:30.249秒
这表明一旦我们有超过1个线程,性能就会下降

我怀疑原因是函数makework()正在进行内存访问。在Visual Studio 2010中,可以通过在_tmain()的第1行设置断点来看到这一点。当遇到断点时,按Ctrl-Alt-D以查看反汇编窗口。在括号中看到寄存器名的任何位置(例如[esp]),这是一种内存访问。CPU上的1级内存缓存带宽正在饱和。您可以使用修改后的makework()测试此理论

void制作(void*jnk){
双tmp=0;
挥发性双*p;
int i;
int j;
p=(双*)jnk;
对于(j=0;j(可能的解释)您检查过这些机器上的后台活动了吗?可能会发生这样的情况,操作系统无法将其所有内核完全奉献给您。在您的机器1上,当您开始占据一半以上的内核时,将开始相当大的增长。您的线程可能会与其他线程竞争资源

您可能还需要检查计算机/帐户上是否存在不允许获取所有可用资源的限制和域策略。

(可能的解释)您检查过这些机器上的后台活动了吗?可能会发生这样的情况,操作系统无法将其所有内核全部奉献给您。在您的机器1上,当您开始占用超过一半的内核时,将开始相当大的增长。您的线程可能会与其他线程竞争资源


您可能还需要检查您的计算机/帐户上是否存在不允许获取所有可用资源的限制和域策略。

好的,现在我们已经排除了内存饱和理论(尽管-x87?哎哟,不要期望有太多性能。如果您能接受SSE/AVX提供的功能,请尝试切换到SSE/AVX)。核心扩展仍然有意义,让我们看看您使用的CPU型号:

你能确认这些是正确的模型吗

Intel® Xeon® Processor X5690 (12M Cache, 3.46 GHz, 6.40 GT/s Intel® QPI)

如果是这样的话,那么第一个确实有6个物理核(12个逻辑核),第二个有8个物理核(16个逻辑核)。想想看,我认为在这几代人中,一个套接字上的核数不可能更高,所以这是有意义的,它完全符合你的数字

编辑: 在多插槽系统上,操作系统可能更喜欢单插槽,而逻辑核心仍然可用。这可能取决于确切的版本,但对于win server 2008,有一个有趣的问题
    void makework(void *jnk) {
    double tmp = 0;
    volatile double *p;
    int i;
    int j;
    p=(double*)jnk;

    for(j=0; j<100000000; j++) {
        for(i=0; i<100; i++) {
            tmp = tmp+(double)i*(double)i;
        }
        *p=tmp;
    }
    *p = tmp;
    _endthread();
}
Starting 1 threads... Elapsed time: 11.684 seconds<br>
Starting 2 threads... Elapsed time: 13.760 seconds<br>
Starting 3 threads... Elapsed time: 14.445 seconds<br>
Starting 4 threads... Elapsed time: 17.519 seconds<br>
Starting 5 threads... Elapsed time: 23.369 seconds<br>
Starting 6 threads... Elapsed time: 25.491 seconds<br>
Starting 7 threads... Elapsed time: 30.155 seconds<br>
Starting 8 threads... Elapsed time: 34.460 seconds<br>
Intel® Xeon® Processor X5690 (12M Cache, 3.46 GHz, 6.40 GT/s Intel® QPI)
Intel® Xeon® Processor E5-2690 (20M Cache, 2.90 GHz, 8.00 GT/s Intel® QPI)
When the OS boots it starts with socket 1 and enumerates all logical processors:

    on socket 1 it enumerates logical processors 1-20
    on socket 2 it enumerates logical processors 21-40
    on socket 3 it enumerates logical processors 41-60
    on socket 4 it would see 61-64
void spawnthreads(int num) {
    ULONG_PTR masks[] = {  // for my system; YMMV
        0x1, 0x4, 0x10, 0x40, 0x100, 0x400, 0x1000, 0x4000, 0x10000, 0x40000, 
        0x100000, 0x400000, 0x2, 0x8, 0x20, 0x80, 0x200, 0x800, 0x2000, 0x8000};
    HANDLE *hThreads = (HANDLE *)malloc(num*sizeof(HANDLE));
    double *junk = (double *)malloc(num*sizeof(double));
    printf("Starting %i threads... ", num);
    for(int i=0; i<num; i++) {
        hThreads[i] = (HANDLE)_beginthread(makework, 0, &junk[i]);
        SetThreadAffinityMask(hThreads[i], masks[i]);
    }
    int start = GetTickCount();
    WaitForMultipleObjects(num, hThreads, TRUE, INFINITE);
    int end = GetTickCount();
    FILE *fp = fopen("makework.log", "a+");
    fprintf(fp, "%i,%.3f,%f\n", num, (double)(end-start)/1000.0, junk[0]);
    fclose(fp);
    printf("Elapsed time: %.3f seconds\n", (double)(end-start)/1000.0);
    free(hThreads);
}
Starting 1 threads... Elapsed time: 12.558 seconds
Starting 2 threads... Elapsed time: 12.558 seconds
Starting 3 threads... Elapsed time: 12.589 seconds
Starting 4 threads... Elapsed time: 12.652 seconds
Starting 5 threads... Elapsed time: 12.621 seconds
Starting 6 threads... Elapsed time: 12.777 seconds
Starting 7 threads... Elapsed time: 12.636 seconds
Starting 8 threads... Elapsed time: 12.886 seconds
Starting 9 threads... Elapsed time: 13.057 seconds
Starting 10 threads... Elapsed time: 12.714 seconds
Starting 11 threads... Elapsed time: 12.777 seconds
Starting 12 threads... Elapsed time: 12.668 seconds
Starting 13 threads... Elapsed time: 26.489 seconds
Starting 14 threads... Elapsed time: 26.505 seconds
Starting 15 threads... Elapsed time: 26.505 seconds
Starting 16 threads... Elapsed time: 26.489 seconds
Starting 17 threads... Elapsed time: 26.489 seconds
Starting 18 threads... Elapsed time: 26.676 seconds
Starting 19 threads... Elapsed time: 26.770 seconds
Starting 20 threads... Elapsed time: 26.489 seconds