C 嵌套用于使用矩阵算法和常量进行循环调试。

C 嵌套用于使用矩阵算法和常量进行循环调试。,c,matrix,sse,nested-loops,matrix-multiplication,C,Matrix,Sse,Nested Loops,Matrix Multiplication,这组嵌套for循环对于M=64和N=64的值正确工作,但当我设置M=128和N=64时不工作。我有另一个程序检查矩阵乘法的正确值。直觉上,它似乎仍然有效,但给了我错误的答案 for(int m=64;m<=M;m+=64){ for(int n=64;n<=N;n+=64){ for(int i = m-64; i < m; i+=16){ float *A_column_start, *C_column_start; __m128 c

这组嵌套for循环对于M=64和N=64的值正确工作,但当我设置M=128和N=64时不工作。我有另一个程序检查矩阵乘法的正确值。直觉上,它似乎仍然有效,但给了我错误的答案

for(int m=64;m<=M;m+=64){
for(int n=64;n<=N;n+=64){
    for(int i = m-64; i < m; i+=16){

        float *A_column_start, *C_column_start;
        __m128 c_1, c_2, c_3, c_4, a_1, a_2, a_3, a_4, mul_1, 
               mul_2, mul_3, mul_4, b_1;
        int j, k;

        for(j = m-64; j < m; j++){

            //Load 16 contiguous column aligned elements from matrix C in
            //c_1-c_4 registers

            C_column_start = C+i+j*M;

            c_1 = _mm_loadu_ps(C_column_start);
            c_2 = _mm_loadu_ps(C_column_start+4);
            c_3 = _mm_loadu_ps(C_column_start+8);
            c_4 = _mm_loadu_ps(C_column_start+12);

            for (k=n-64; k < n; k+=2){

                //Load 16 contiguous column aligned elements from matrix A to
                //the a_1-a_4 registers

                A_column_start = A+k*M;

                a_1 = _mm_loadu_ps(A_column_start+i);
                a_2 = _mm_loadu_ps(A_column_start+i+4);
                a_3 = _mm_loadu_ps(A_column_start+i+8);
                a_4 = _mm_loadu_ps(A_column_start+i+12);

                //Load a value to resgister b_1 to act as a "B" or ("A^T") 
                //element to multiply against the A matrix

                b_1 = _mm_load1_ps(A_column_start+j);

                mul_1 = _mm_mul_ps(a_1, b_1);
                mul_2 = _mm_mul_ps(a_2, b_1);
                mul_3 = _mm_mul_ps(a_3, b_1);
                mul_4 = _mm_mul_ps(a_4, b_1);

                //Add together all values of the multiplied A and "B"
                //(or "A^T") matrix elements

                c_4 = _mm_add_ps(c_4, mul_4);
                c_3 = _mm_add_ps(c_3, mul_3);
                c_2 = _mm_add_ps(c_2, mul_2);
                c_1 = _mm_add_ps(c_1, mul_1);

                //Move over one column in A, and load the next 16 contiguous 
                //column aligned elements from matrix A to the a_1-a_4 registers

                A_column_start+=M;

                a_1 = _mm_loadu_ps(A_column_start+i);
                a_2 = _mm_loadu_ps(A_column_start+i+4);
                a_3 = _mm_loadu_ps(A_column_start+i+8);
                a_4 = _mm_loadu_ps(A_column_start+i+12);

                //Load a value to resgister b_1 to act as a "B" or "A^T"
                //element to multiply against the A matrix

                b_1 = _mm_load1_ps(A_column_start+j);

                mul_1 = _mm_mul_ps(a_1, b_1);
                mul_2 = _mm_mul_ps(a_2, b_1);
                mul_3 = _mm_mul_ps(a_3, b_1);
                mul_4 = _mm_mul_ps(a_4, b_1);

                //Add together all values of the multiplied A and "B" or
                //("A^T") matrix elements

                c_4 = _mm_add_ps(c_4, mul_4);
                c_3 = _mm_add_ps(c_3, mul_3);
                c_2 = _mm_add_ps(c_2, mul_2);
                c_1 = _mm_add_ps(c_1, mul_1);

            }
            //Store the added up C values back to memory

            _mm_storeu_ps(C_column_start, c_1);
            _mm_storeu_ps(C_column_start+4, c_2);
            _mm_storeu_ps(C_column_start+8, c_3);
            _mm_storeu_ps(C_column_start+12, c_4);

        }

    }
    }
}}

for(int m=64;m我猜您在代码中使用了
m

C_column_start = C+i+j*M;
需要改为使用
m
。也可能在使用
m
的其他行中。
但是,我并不真正理解您的代码,因为您没有解释代码的用途,我也不是数学程序员。

它对m=64和N=64正确工作,因为在这些情况下,您只在相应的循环中执行一次迭代(最外层的两次).当M=128时,现在在外环上执行两个步骤,在这种情况下,线

C_column_start = C+i+j*M;
线路呢

A_column_start = A+k*M;
将为内部循环产生相同的结果,因此基本上对于在外部循环(m=64128)上执行的两个步骤,您只需将一个步骤的结果乘以m=128即可。修复方法非常简单,只需将m更改为m,以便使用迭代变量


你也应该考虑在A和C中对齐你的数据,这样你就可以执行SSE对齐的加载。这将导致更快的代码。< /P>我喜欢当PPL声称解决方案很简单,当他们不能找到答案的时候,你自己吗?你试过在调试器中通过代码来查看什么循环被执行了吗?如果M和N都是64,那么外部两个循环将只执行一次。我还会考虑定义64,并将其减少到更小的值,以生成一个最小的测试用例,帮助您了解正在发生的事情。