如何编写一个能与Eigen竞争的矩阵积？是C++实现，比较Eigen和for循环执行矩阵矩阵乘积所用的时间。For循环经过优化，以最大限度地减少缓存未命中。for循环最初比Eigen循环快，但最终会变慢（对于500×500矩阵，最多为2倍）。我还应该做些什么来与Eigen竞争？阻塞是本征性能更好的原因吗？如果是这样，我应该如何向for循环添加阻塞 #include<iostream> #include<Eigen/Dense> #include<ctime> int main(int argc, char* argv[]) { srand(time(NULL)); // Input the size of the matrix from the user int N = atoi(argv[1]); int M = N*N; // The matrices stored as row-wise vectors double a[M]; double b[M]; double c[M]; // Initializing Eigen Matrices Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd c_E(N,N); double CPS = CLOCKS_PER_SEC; clock_t start, end; // Matrix vector product by Eigen start = clock(); c_E = a_E*b_E; end = clock(); std::cout << "\nTime taken by Eigen is: " << (end-start)/CPS << "\n"; // Initializing For loop vectors int count = 0; for (int j=0; j<N; ++j) { for (int k=0; k<N; ++k) { a[count] = a_E(j,k); b[count] = b_E(j,k); ++count; } } // Matrix vector product by For loop start = clock(); count = 0; int count1, count2; for (int j=0; j<N; ++j) { count1 = j*N; for (int k=0; k<N; ++k) { c[count] = a[count1]*b[k]; ++count; } } for (int j=0; j<N; ++j) { count2 = N; for (int l=1; l<N; ++l) { count = j*N; count1 = count+l; for (int k=0; k<N; ++k) { c[count]+=a[count1]*b[count2]; ++count; ++count2; } } } end = clock(); std::cout << "\nTime taken by for-loop is: " << (end-start)/CPS << "\n"; } #包括 #包括 #包括 int main（int argc，char*argv[]）{ srand（时间（空））； //从用户处输入矩阵的大小 int N=atoi（argv[1]）； int M=N*N； //矩阵存储为行向量双a[M]；双b[M]；双c[M]； //初始化特征矩阵本征：：矩阵xxd a_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵xxd b_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵c_E（N，N）；双CPS=时钟每秒；时钟开始、结束； //特征矩阵向量积开始=时钟（）； c_E=a_E*b_E；结束=时钟（）； std：：cout_C++_Matrix_Matrix Multiplication_Eigen

cplusplus/
如何编写一个能与Eigen竞争的矩阵积？是C++实现，比较Eigen和for循环执行矩阵矩阵乘积所用的时间。For循环经过优化，以最大限度地减少缓存未命中。for循环最初比Eigen循环快，但最终会变慢（对于500×500矩阵，最多为2倍）。我还应该做些什么来与Eigen竞争？阻塞是本征性能更好的原因吗？如果是这样，我应该如何向for循环添加阻塞 #include<iostream> #include<Eigen/Dense> #include<ctime> int main(int argc, char* argv[]) { srand(time(NULL)); // Input the size of the matrix from the user int N = atoi(argv[1]); int M = N*N; // The matrices stored as row-wise vectors double a[M]; double b[M]; double c[M]; // Initializing Eigen Matrices Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd c_E(N,N); double CPS = CLOCKS_PER_SEC; clock_t start, end; // Matrix vector product by Eigen start = clock(); c_E = a_E*b_E; end = clock(); std::cout << "\nTime taken by Eigen is: " << (end-start)/CPS << "\n"; // Initializing For loop vectors int count = 0; for (int j=0; j<N; ++j) { for (int k=0; k<N; ++k) { a[count] = a_E(j,k); b[count] = b_E(j,k); ++count; } } // Matrix vector product by For loop start = clock(); count = 0; int count1, count2; for (int j=0; j<N; ++j) { count1 = j*N; for (int k=0; k<N; ++k) { c[count] = a[count1]*b[k]; ++count; } } for (int j=0; j<N; ++j) { count2 = N; for (int l=1; l<N; ++l) { count = j*N; count1 = count+l; for (int k=0; k<N; ++k) { c[count]+=a[count1]*b[count2]; ++count; ++count2; } } } end = clock(); std::cout << "\nTime taken by for-loop is: " << (end-start)/CPS << "\n"; } #包括 #包括 #包括 int main（int argc，char*argv[]）{ srand（时间（空））； //从用户处输入矩阵的大小 int N=atoi（argv[1]）； int M=N*N； //矩阵存储为行向量双a[M]；双b[M]；双c[M]； //初始化特征矩阵本征：：矩阵xxd a_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵xxd b_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵c_E（N，N）；双CPS=时钟每秒；时钟开始、结束； //特征矩阵向量积开始=时钟（）； c_E=a_E*b_E；结束=时钟（）； std：：cout

如何编写一个能与Eigen竞争的矩阵积？是C++实现，比较Eigen和for循环执行矩阵矩阵乘积所用的时间。For循环经过优化，以最大限度地减少缓存未命中。for循环最初比Eigen循环快，但最终会变慢（对于500×500矩阵，最多为2倍）。我还应该做些什么来与Eigen竞争？阻塞是本征性能更好的原因吗？如果是这样，我应该如何向for循环添加阻塞 #include<iostream> #include<Eigen/Dense> #include<ctime> int main(int argc, char* argv[]) { srand(time(NULL)); // Input the size of the matrix from the user int N = atoi(argv[1]); int M = N*N; // The matrices stored as row-wise vectors double a[M]; double b[M]; double c[M]; // Initializing Eigen Matrices Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd c_E(N,N); double CPS = CLOCKS_PER_SEC; clock_t start, end; // Matrix vector product by Eigen start = clock(); c_E = a_E*b_E; end = clock(); std::cout << "\nTime taken by Eigen is: " << (end-start)/CPS << "\n"; // Initializing For loop vectors int count = 0; for (int j=0; j<N; ++j) { for (int k=0; k<N; ++k) { a[count] = a_E(j,k); b[count] = b_E(j,k); ++count; } } // Matrix vector product by For loop start = clock(); count = 0; int count1, count2; for (int j=0; j<N; ++j) { count1 = j*N; for (int k=0; k<N; ++k) { c[count] = a[count1]*b[k]; ++count; } } for (int j=0; j<N; ++j) { count2 = N; for (int l=1; l<N; ++l) { count = j*N; count1 = count+l; for (int k=0; k<N; ++k) { c[count]+=a[count1]*b[count2]; ++count; ++count2; } } } end = clock(); std::cout << "\nTime taken by for-loop is: " << (end-start)/CPS << "\n"; } #包括 #包括 #包括 int main（int argc，char*argv[]）{ srand（时间（空））； //从用户处输入矩阵的大小 int N=atoi（argv[1]）； int M=N*N； //矩阵存储为行向量双a[M]；双b[M]；双c[M]； //初始化特征矩阵本征：：矩阵xxd a_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵xxd b_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵c_E（N，N）；双CPS=时钟每秒；时钟开始、结束； //特征矩阵向量积开始=时钟（）； c_E=a_E*b_E；结束=时钟（）； std：：cout

c++ matrix

如何编写一个能与Eigen竞争的矩阵积？是C++实现，比较Eigen和for循环执行矩阵矩阵乘积所用的时间。For循环经过优化，以最大限度地减少缓存未命中。for循环最初比Eigen循环快，但最终会变慢（对于500×500矩阵，最多为2倍）。我还应该做些什么来与Eigen竞争？阻塞是本征性能更好的原因吗？如果是这样，我应该如何向for循环添加阻塞 #include<iostream> #include<Eigen/Dense> #include<ctime> int main(int argc, char* argv[]) { srand(time(NULL)); // Input the size of the matrix from the user int N = atoi(argv[1]); int M = N*N; // The matrices stored as row-wise vectors double a[M]; double b[M]; double c[M]; // Initializing Eigen Matrices Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N); Eigen::MatrixXd c_E(N,N); double CPS = CLOCKS_PER_SEC; clock_t start, end; // Matrix vector product by Eigen start = clock(); c_E = a_E*b_E; end = clock(); std::cout << "\nTime taken by Eigen is: " << (end-start)/CPS << "\n"; // Initializing For loop vectors int count = 0; for (int j=0; j<N; ++j) { for (int k=0; k<N; ++k) { a[count] = a_E(j,k); b[count] = b_E(j,k); ++count; } } // Matrix vector product by For loop start = clock(); count = 0; int count1, count2; for (int j=0; j<N; ++j) { count1 = j*N; for (int k=0; k<N; ++k) { c[count] = a[count1]*b[k]; ++count; } } for (int j=0; j<N; ++j) { count2 = N; for (int l=1; l<N; ++l) { count = j*N; count1 = count+l; for (int k=0; k<N; ++k) { c[count]+=a[count1]*b[count2]; ++count; ++count2; } } } end = clock(); std::cout << "\nTime taken by for-loop is: " << (end-start)/CPS << "\n"; } #包括 #包括 #包括 int main（int argc，char*argv[]）{ srand（时间（空））； //从用户处输入矩阵的大小 int N=atoi（argv[1]）； int M=N*N； //矩阵存储为行向量双a[M]；双b[M]；双c[M]； //初始化特征矩阵本征：：矩阵xxd a_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵xxd b_E=本征：：矩阵xxd：：随机（N，N）；本征：：矩阵c_E（N，N）；双CPS=时钟每秒；时钟开始、结束； //特征矩阵向量积开始=时钟（）； c_E=a_E*b_E；结束=时钟（）； std：：cout,c++,matrix,matrix-multiplication,eigen,C++,Matrix,Matrix Multiplication,Eigen,我可以建议两个简单的优化 1）向量化。如果使用内联汇编或编写汇编过程对其进行向量化会更好，但也可以使用编译器内部函数。您甚至可以让编译器对循环进行向量化，但有时很难编写适当的循环以由编译器进行向量化 2）使其并行。尝试使用OpenMP。编译器已经对您的代码进行了很好的矢量化。更高性能的关键是分层阻塞，以优化寄存器和不同级别缓存的使用。部分循环展开对于改进指令管道也至关重要。达到Eigen产品的性能需要大量的努力和调整还应该注意的是，您的基准测试有点偏颇，不完全可靠。这里有一个更可靠的版本（

我可以建议两个简单的优化

1）向量化。如果使用内联汇编或编写汇编过程对其进行向量化会更好，但也可以使用编译器内部函数。您甚至可以让编译器对循环进行向量化，但有时很难编写适当的循环以由编译器进行向量化

2）使其并行。尝试使用OpenMP。

编译器已经对您的代码进行了很好的矢量化。更高性能的关键是分层阻塞，以优化寄存器和不同级别缓存的使用。部分循环展开对于改进指令管道也至关重要。达到Eigen产品的性能需要大量的努力和调整

还应该注意的是，您的基准测试有点偏颇，不完全可靠。这里有一个更可靠的版本（您需要完整的Eigen源代码才能获得

bench/BenchTimer.h

）：

#包括
#包括
#包括
void myprod（双精度*c，常数双精度*a，常数双精度*b，整数N）{
整数计数=0；
int count1，count2；
对于（int j=0；j，没有必要神秘化如何实现矩阵积的高性能实现。事实上，我们需要更多的人了解它，以应对未来高性能计算的挑战。为了进入本主题，阅读是一个很好的起点
因此，为了揭开谜团并回答这个问题（如何编写一个可以与Eigen竞争的矩阵积），我将ggael发布的代码扩展到总共400行。我刚刚在AVX机器（Intel（R）Core（TM）i5-3470 CPU@3.20GHz）上对其进行了测试。以下是一些结果：
g++-5.3 -O3 -DNDEBUG -std=c++11 -mavx -m64 -I ../eigen.3.2.8/ gemm.cc -lrt

lehn@heim:~/work/test_eigen$ ./a.out 500
Time taken by Eigen is: 0.0190425
Time taken by for-loop is: 0.0121688

lehn@heim:~/work/test_eigen$ ./a.out 1000
Time taken by Eigen is: 0.147991
Time taken by for-loop is: 0.0959097

lehn@heim:~/work/test_eigen$ ./a.out 1500
Time taken by Eigen is: 0.492858
Time taken by for-loop is: 0.322442

lehn@heim:~/work/test_eigen$ ./a.out 5000
Time taken by Eigen is: 18.3666
Time taken by for-loop is: 12.1023

如果你有FMA，你可以用它编译
g++-5.3 -O3 -DNDEBUG -std=c++11 -mfma -m64 -I ../eigen.3.2.8/ -DHAVE_FMA gemm.cc -lrt

如果您还希望使用openMP进行多线程处理，也可以使用-fopenmp

以下是基于BLIS论文思想的完整代码。它是独立的，只是需要完整的Eigen源文件，正如ggael已经指出的：
#include<iostream>
#include<Eigen/Dense>
#include<bench/BenchTimer.h>
#if defined(_OPENMP)
#include <omp.h>
#endif
//-- malloc with alignment --------------------------------------------------------
void *
malloc_(std::size_t alignment, std::size_t size)
{
    alignment = std::max(alignment, alignof(void *));
    size     += alignment;

    void *ptr  = std::malloc(size);
    void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1));
    void **vp  = (void**) ptr2 - 1;
    *vp        = ptr;
    return ptr2;
}

void
free_(void *ptr)
{
    std::free(*((void**)ptr-1));
}

//-- Config --------------------------------------------------------------------

// SIMD-Register width in bits
// SSE:         128
// AVX/FMA:     256
// AVX-512:     512
#ifndef SIMD_REGISTER_WIDTH
#define SIMD_REGISTER_WIDTH 256
#endif

#ifdef HAVE_FMA

#   ifndef BS_D_MR
#   define BS_D_MR 4
#   endif

#   ifndef BS_D_NR
#   define BS_D_NR 12
#   endif

#   ifndef BS_D_MC
#   define BS_D_MC 256
#   endif

#   ifndef BS_D_KC
#   define BS_D_KC 512
#   endif

#   ifndef BS_D_NC
#   define BS_D_NC 4092
#   endif

#endif



#ifndef BS_D_MR
#define BS_D_MR 4
#endif

#ifndef BS_D_NR
#define BS_D_NR 8
#endif

#ifndef BS_D_MC
#define BS_D_MC 256
#endif

#ifndef BS_D_KC
#define BS_D_KC 256
#endif

#ifndef BS_D_NC
#define BS_D_NC 4096
#endif

template <typename T>
struct BlockSize
{
    static constexpr int MC = 64;
    static constexpr int KC = 64;
    static constexpr int NC = 256;
    static constexpr int MR = 8;
    static constexpr int NR = 8;

    static constexpr int rwidth = 0;
    static constexpr int align  = alignof(T);
    static constexpr int vlen   = 0;

    static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size.");
    static_assert(MC % MR == 0, "MC must be a multiple of MR.");
    static_assert(NC % NR == 0, "NC must be a multiple of NR.");
};


template <>
struct BlockSize<double>
{
    static constexpr int MC     = BS_D_MC;
    static constexpr int KC     = BS_D_KC;
    static constexpr int NC     = BS_D_NC;
    static constexpr int MR     = BS_D_MR;
    static constexpr int NR     = BS_D_NR;

    static constexpr int rwidth = SIMD_REGISTER_WIDTH;
    static constexpr int align  = rwidth / 8;
    static constexpr int vlen   = rwidth / (8*sizeof(double));

    static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size.");
    static_assert(MC % MR == 0, "MC must be a multiple of MR.");
    static_assert(NC % NR == 0, "NC must be a multiple of NR.");
    static_assert(rwidth % sizeof(double) == 0, "SIMD register width not sane.");
};

//-- aux routines --------------------------------------------------------------
template <typename Index, typename Alpha, typename TX, typename TY>
void
geaxpy(Index m, Index n,
       const Alpha &alpha,
       const TX *X, Index incRowX, Index incColX,
       TY       *Y, Index incRowY, Index incColY)
{
    for (Index j=0; j<n; ++j) {
        for (Index i=0; i<m; ++i) {
            Y[i*incRowY+j*incColY] += alpha*X[i*incRowX+j*incColX];
        }
    }
}

template <typename Index, typename Alpha, typename TX>
void
gescal(Index m, Index n,
       const Alpha &alpha,
       TX *X, Index incRowX, Index incColX)
{
    if (alpha!=Alpha(0)) {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] *= alpha;
            }
        }
    } else {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] = Alpha(0);
            }
        }
    }
}


//-- Micro Kernel --------------------------------------------------------------
template <typename Index, typename T>
typename std::enable_if<BlockSize<T>::vlen != 0,
         void>::type
ugemm(Index kc, T alpha, const T *A, const T *B, T beta,
      T *C, Index incRowC, Index incColC)
{
    typedef T vx __attribute__((vector_size (BlockSize<T>::rwidth/8)));

    static constexpr Index vlen = BlockSize<T>::vlen;
    static constexpr Index MR   = BlockSize<T>::MR;
    static constexpr Index NR   = BlockSize<T>::NR/vlen;

    A = (const T*) __builtin_assume_aligned (A, BlockSize<T>::align);
    B = (const T*) __builtin_assume_aligned (B, BlockSize<T>::align);

    vx P[MR*NR] = {};

    for (Index l=0; l<kc; ++l) {
        const vx *b = (const vx *)B;
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                P[i*NR+j] += A[i]*b[j];
            }
        }
        A += MR;
        B += vlen*NR;
    }

    if (alpha!=T(1)) {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                P[i*NR+j] *= alpha;
            }
        }
    }

    if (beta!=T(0)) {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                const T *p = (const T *) &P[i*NR+j];
                for (Index j1=0; j1<vlen; ++j1) {
                    C[i*incRowC+(j*vlen+j1)*incColC] *= beta;
                    C[i*incRowC+(j*vlen+j1)*incColC] += p[j1];
                }
            }
        }
    } else {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                const T *p = (const T *) &P[i*NR+j];
                for (Index j1=0; j1<vlen; ++j1) {
                    C[i*incRowC+(j*vlen+j1)*incColC] = p[j1];
                }
            }
        }
    }
}

//-- Macro Kernel --------------------------------------------------------------
template <typename Index, typename T, typename Beta, typename TC>
void
mgemm(Index mc, Index nc, Index kc,
      T alpha,
      const T *A, const T *B,
      Beta beta,
      TC *C, Index incRowC, Index incColC)
{
    const Index MR = BlockSize<T>::MR;
    const Index NR = BlockSize<T>::NR;
    const Index mp  = (mc+MR-1) / MR;
    const Index np  = (nc+NR-1) / NR;
    const Index mr_ = mc % MR;
    const Index nr_ = nc % NR;

    T C_[MR*NR];

    #pragma omp parallel for
    for (Index j=0; j<np; ++j) {
        const Index nr = (j!=np-1 || nr_==0) ? NR : nr_;

        for (Index i=0; i<mp; ++i) {
            const Index mr = (i!=mp-1 || mr_==0) ? MR : mr_;

            if (mr==MR && nr==NR) {
                ugemm(kc, alpha,
                      &A[i*kc*MR], &B[j*kc*NR],
                      beta,
                      &C[i*MR*incRowC+j*NR*incColC],
                      incRowC, incColC);
            } else {
                ugemm(kc, alpha,
                      &A[i*kc*MR], &B[j*kc*NR],
                      T(0),
                      C_, Index(1), MR);
                gescal(mr, nr, beta,
                       &C[i*MR*incRowC+j*NR*incColC],
                       incRowC, incColC);
                geaxpy(mr, nr, T(1), C_, Index(1), MR,
                       &C[i*MR*incRowC+j*NR*incColC],
                       incRowC, incColC);
            }
        }
    }
}
//-- Packing blocks ------------------------------------------------------------
template <typename Index, typename TA, typename T>
void
pack_A(Index mc, Index kc,
       const TA *A, Index incRowA, Index incColA,
       T *p)
{
    Index MR = BlockSize<T>::MR;
    Index mp = (mc+MR-1) / MR;

    for (Index j=0; j<kc; ++j) {
        for (Index l=0; l<mp; ++l) {
            for (Index i0=0; i0<MR; ++i0) {
                Index i  = l*MR + i0;
                Index nu = l*MR*kc + j*MR + i0;
                p[nu]   = (i<mc) ? A[i*incRowA+j*incColA]
                                 : T(0);
            }
        }
    }
}

template <typename Index, typename TB, typename T>
void
pack_B(Index kc, Index nc,
       const TB *B, Index incRowB, Index incColB,
       T *p)
{
    Index NR = BlockSize<T>::NR;
    Index np = (nc+NR-1) / NR;

    for (Index l=0; l<np; ++l) {
        for (Index j0=0; j0<NR; ++j0) {
            for (Index i=0; i<kc; ++i) {
                Index j  = l*NR+j0;
                Index nu = l*NR*kc + i*NR + j0;
                p[nu]   = (j<nc) ? B[i*incRowB+j*incColB]
                                 : T(0);
            }
        }
    }
}
//-- Frame routine -------------------------------------------------------------
template <typename Index, typename Alpha,
         typename TA, typename TB,
         typename Beta,
         typename TC>
void
gemm(Index m, Index n, Index k,
     Alpha alpha,
     const TA *A, Index incRowA, Index incColA,
     const TB *B, Index incRowB, Index incColB,
     Beta beta,
     TC *C, Index incRowC, Index incColC)
{
    typedef typename std::common_type<Alpha, TA, TB>::type  T;

    const Index MC = BlockSize<T>::MC;
    const Index NC = BlockSize<T>::NC;
    const Index MR = BlockSize<T>::MR;
    const Index NR = BlockSize<T>::NR;

    const Index KC = BlockSize<T>::KC;
    const Index mb = (m+MC-1) / MC;
    const Index nb = (n+NC-1) / NC;
    const Index kb = (k+KC-1) / KC;
    const Index mc_ = m % MC;
    const Index nc_ = n % NC;
    const Index kc_ = k % KC;

    T *A_ = (T*) malloc_(BlockSize<T>::align, sizeof(T)*(MC*KC+MR));
    T *B_ = (T*) malloc_(BlockSize<T>::align, sizeof(T)*(KC*NC+NR));

    if (alpha==Alpha(0) || k==0) {
        gescal(m, n, beta, C, incRowC, incColC);
        return;
    }

    for (Index j=0; j<nb; ++j) {
        Index nc = (j!=nb-1 || nc_==0) ? NC : nc_;

        for (Index l=0; l<kb; ++l) {
            Index   kc  = (l!=kb-1 || kc_==0) ? KC : kc_;
            Beta beta_  = (l==0) ? beta : Beta(1);

            pack_B(kc, nc,
                   &B[l*KC*incRowB+j*NC*incColB],
                   incRowB, incColB,
                   B_);

            for (Index i=0; i<mb; ++i) {
                Index mc = (i!=mb-1 || mc_==0) ? MC : mc_;

                pack_A(mc, kc,
                       &A[i*MC*incRowA+l*KC*incColA],
                       incRowA, incColA,
                       A_);

                mgemm(mc, nc, kc,
                      T(alpha), A_, B_, beta_,
                      &C[i*MC*incRowC+j*NC*incColC],
                      incRowC, incColC);
            }
        }
    }
    free_(A_);
    free_(B_);
}

//------------------------------------------------------------------------------

void myprod(double *c, const double* a, const double* b, int N) {
    gemm(N, N, N, 1.0, a, 1, N, b, 1, N, 0.0, c, 1, N);
}

int main(int argc, char* argv[]) {
  int N = atoi(argv[1]);
  int tries = 4;
  int rep = std::max<int>(1,10000000/N/N/N);

  Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd c_E(N,N);

  Eigen::BenchTimer t1, t2;

  BENCH(t1, tries, rep, c_E.noalias() = a_E*b_E );
  BENCH(t2, tries, rep, myprod(c_E.data(), a_E.data(), b_E.data(), N));

  std::cout << "Time taken by Eigen is: " << t1.best() << "\n";
  std::cout << "Time taken by for-loop is: " << t2.best() << "\n\n";
}

#包括
#包括
#包括
#如果已定义（\u OPENMP）
#包括
#恩迪夫
//--带对齐的malloc--------------------------------------------------------
空虚*
malloc（标准：：大小对齐，标准：：大小对齐）
{
校准=标准：：最大值（校准，校准（无效*）；
尺寸+=对齐；
void*ptr=std:：malloc（尺寸）；
void*ptr2=（void*）（（uintptr_t）ptr+校准）和（校准-1））；
void**vp=（void**）ptr2-1；
*vp=ptr；
返回ptr2；
}
无效的
自由（无效*ptr）
{
标准：：自由（*（无效**）ptr-1））；
}
//--配置--------------------------------------------------------------------
//SIMD寄存器宽度（位）
//上海证券交易所：128
//AVX/FMA:256
//AVX-512:512
#ifndef SIMD\u寄存器\u宽度
#定义SIMD_寄存器_宽度256
#恩迪夫
#如果有
#如果没有，请告诉我
#定义BS_D_MR 4
#恩迪夫
#如果没有
#定义BS_D_NR 12
#恩迪夫
#如果没有
#定义BS_D_MC 256
#恩迪夫
#ifndef BS_D_KC
#定义BS_D_KC 512
#恩迪夫
#如果没有
#定义BS_D_NC 4092
#恩迪夫
#恩迪夫
#如果没有，请告诉我
#定义BS_D_MR 4
#恩迪夫
#如果没有
#定义BS_D_NR 8
#恩迪夫
#如果没有
#定义BS_D_MC 256
#恩迪夫
#ifndef BS_D_KC
#定义BS_D_KC 256
#恩迪夫
#如果没有
#定义BS_D_NC 4096
#恩迪夫
模板
结构块大小
{
静态constexpr int MC=64；
静态constexpr int KC=64；
静态constexpr int NC=256；
静态constexpr int MR=8；
静态constexpr int NR=8；
静态constexpr int rwidth=0；
静态constexpr int align=alignof（T）；
静态constexpr int vlen=0；
静态断言（MC>0&&KC>0&&NC>0&&MR>0&&NR>0，“无效块大小”）；
静态_断言（MC%MR==0，“MC必须是MR的倍数”）；
静态断言（NC%NR==0，“NC必须是NR的倍数”）；
};
模板
结构块大小
{
静态constexpr int MC=BS_D_MC；
静态constexpr int KC=BS_D_KC；
静态constexpr int NC=BS_D_NC；
静态constexpr int MR=BS_D_MR；
静态constexpr int NR=BS_D_NR；
静态constexpr int rwidth=单指令多数据寄存器宽度；
静态constexpr int align=rwidth/8；
静态constexpr int vlen=rwidth/（8*sizeof（双精度））；
静态断言（MC>0&&KC>0&&NC>0&&MR>0&&NR>0，“无效块大小”）；
静态_断言（MC%MR==0，“MC必须是MR的倍数”）；
静态断言（NC%NR==0，“NC必须是NR的倍数”）；
静态_断言（rwidth%sizeof（double）=0，“SIMD寄存器宽度不正常”）；
};
//--辅助程序--------------------------------------------------------------
模板
无效的
geaxpy（指数m，指数n，
常数α和α，
常数TX*X，索引incRowX，索引incColX，
Y*Y，索引增加，索引增加）
{
对于（索引j=0；j）来说，这是一个很好的教程，展示了如何spe
#include<iostream>
#include<Eigen/Dense>
#include<bench/BenchTimer.h>
#if defined(_OPENMP)
#include <omp.h>
#endif
//-- malloc with alignment --------------------------------------------------------
void *
malloc_(std::size_t alignment, std::size_t size)
{
    alignment = std::max(alignment, alignof(void *));
    size     += alignment;

    void *ptr  = std::malloc(size);
    void *ptr2 = (void *)(((uintptr_t)ptr + alignment) & ~(alignment-1));
    void **vp  = (void**) ptr2 - 1;
    *vp        = ptr;
    return ptr2;
}

void
free_(void *ptr)
{
    std::free(*((void**)ptr-1));
}

//-- Config --------------------------------------------------------------------

// SIMD-Register width in bits
// SSE:         128
// AVX/FMA:     256
// AVX-512:     512
#ifndef SIMD_REGISTER_WIDTH
#define SIMD_REGISTER_WIDTH 256
#endif

#ifdef HAVE_FMA

#   ifndef BS_D_MR
#   define BS_D_MR 4
#   endif

#   ifndef BS_D_NR
#   define BS_D_NR 12
#   endif

#   ifndef BS_D_MC
#   define BS_D_MC 256
#   endif

#   ifndef BS_D_KC
#   define BS_D_KC 512
#   endif

#   ifndef BS_D_NC
#   define BS_D_NC 4092
#   endif

#endif



#ifndef BS_D_MR
#define BS_D_MR 4
#endif

#ifndef BS_D_NR
#define BS_D_NR 8
#endif

#ifndef BS_D_MC
#define BS_D_MC 256
#endif

#ifndef BS_D_KC
#define BS_D_KC 256
#endif

#ifndef BS_D_NC
#define BS_D_NC 4096
#endif

template <typename T>
struct BlockSize
{
    static constexpr int MC = 64;
    static constexpr int KC = 64;
    static constexpr int NC = 256;
    static constexpr int MR = 8;
    static constexpr int NR = 8;

    static constexpr int rwidth = 0;
    static constexpr int align  = alignof(T);
    static constexpr int vlen   = 0;

    static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size.");
    static_assert(MC % MR == 0, "MC must be a multiple of MR.");
    static_assert(NC % NR == 0, "NC must be a multiple of NR.");
};


template <>
struct BlockSize<double>
{
    static constexpr int MC     = BS_D_MC;
    static constexpr int KC     = BS_D_KC;
    static constexpr int NC     = BS_D_NC;
    static constexpr int MR     = BS_D_MR;
    static constexpr int NR     = BS_D_NR;

    static constexpr int rwidth = SIMD_REGISTER_WIDTH;
    static constexpr int align  = rwidth / 8;
    static constexpr int vlen   = rwidth / (8*sizeof(double));

    static_assert(MC>0 && KC>0 && NC>0 && MR>0 && NR>0, "Invalid block size.");
    static_assert(MC % MR == 0, "MC must be a multiple of MR.");
    static_assert(NC % NR == 0, "NC must be a multiple of NR.");
    static_assert(rwidth % sizeof(double) == 0, "SIMD register width not sane.");
};

//-- aux routines --------------------------------------------------------------
template <typename Index, typename Alpha, typename TX, typename TY>
void
geaxpy(Index m, Index n,
       const Alpha &alpha,
       const TX *X, Index incRowX, Index incColX,
       TY       *Y, Index incRowY, Index incColY)
{
    for (Index j=0; j<n; ++j) {
        for (Index i=0; i<m; ++i) {
            Y[i*incRowY+j*incColY] += alpha*X[i*incRowX+j*incColX];
        }
    }
}

template <typename Index, typename Alpha, typename TX>
void
gescal(Index m, Index n,
       const Alpha &alpha,
       TX *X, Index incRowX, Index incColX)
{
    if (alpha!=Alpha(0)) {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] *= alpha;
            }
        }
    } else {
        for (Index j=0; j<n; ++j) {
            for (Index i=0; i<m; ++i) {
                X[i*incRowX+j*incColX] = Alpha(0);
            }
        }
    }
}


//-- Micro Kernel --------------------------------------------------------------
template <typename Index, typename T>
typename std::enable_if<BlockSize<T>::vlen != 0,
         void>::type
ugemm(Index kc, T alpha, const T *A, const T *B, T beta,
      T *C, Index incRowC, Index incColC)
{
    typedef T vx __attribute__((vector_size (BlockSize<T>::rwidth/8)));

    static constexpr Index vlen = BlockSize<T>::vlen;
    static constexpr Index MR   = BlockSize<T>::MR;
    static constexpr Index NR   = BlockSize<T>::NR/vlen;

    A = (const T*) __builtin_assume_aligned (A, BlockSize<T>::align);
    B = (const T*) __builtin_assume_aligned (B, BlockSize<T>::align);

    vx P[MR*NR] = {};

    for (Index l=0; l<kc; ++l) {
        const vx *b = (const vx *)B;
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                P[i*NR+j] += A[i]*b[j];
            }
        }
        A += MR;
        B += vlen*NR;
    }

    if (alpha!=T(1)) {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                P[i*NR+j] *= alpha;
            }
        }
    }

    if (beta!=T(0)) {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                const T *p = (const T *) &P[i*NR+j];
                for (Index j1=0; j1<vlen; ++j1) {
                    C[i*incRowC+(j*vlen+j1)*incColC] *= beta;
                    C[i*incRowC+(j*vlen+j1)*incColC] += p[j1];
                }
            }
        }
    } else {
        for (Index i=0; i<MR; ++i) {
            for (Index j=0; j<NR; ++j) {
                const T *p = (const T *) &P[i*NR+j];
                for (Index j1=0; j1<vlen; ++j1) {
                    C[i*incRowC+(j*vlen+j1)*incColC] = p[j1];
                }
            }
        }
    }
}

//-- Macro Kernel --------------------------------------------------------------
template <typename Index, typename T, typename Beta, typename TC>
void
mgemm(Index mc, Index nc, Index kc,
      T alpha,
      const T *A, const T *B,
      Beta beta,
      TC *C, Index incRowC, Index incColC)
{
    const Index MR = BlockSize<T>::MR;
    const Index NR = BlockSize<T>::NR;
    const Index mp  = (mc+MR-1) / MR;
    const Index np  = (nc+NR-1) / NR;
    const Index mr_ = mc % MR;
    const Index nr_ = nc % NR;

    T C_[MR*NR];

    #pragma omp parallel for
    for (Index j=0; j<np; ++j) {
        const Index nr = (j!=np-1 || nr_==0) ? NR : nr_;

        for (Index i=0; i<mp; ++i) {
            const Index mr = (i!=mp-1 || mr_==0) ? MR : mr_;

            if (mr==MR && nr==NR) {
                ugemm(kc, alpha,
                      &A[i*kc*MR], &B[j*kc*NR],
                      beta,
                      &C[i*MR*incRowC+j*NR*incColC],
                      incRowC, incColC);
            } else {
                ugemm(kc, alpha,
                      &A[i*kc*MR], &B[j*kc*NR],
                      T(0),
                      C_, Index(1), MR);
                gescal(mr, nr, beta,
                       &C[i*MR*incRowC+j*NR*incColC],
                       incRowC, incColC);
                geaxpy(mr, nr, T(1), C_, Index(1), MR,
                       &C[i*MR*incRowC+j*NR*incColC],
                       incRowC, incColC);
            }
        }
    }
}
//-- Packing blocks ------------------------------------------------------------
template <typename Index, typename TA, typename T>
void
pack_A(Index mc, Index kc,
       const TA *A, Index incRowA, Index incColA,
       T *p)
{
    Index MR = BlockSize<T>::MR;
    Index mp = (mc+MR-1) / MR;

    for (Index j=0; j<kc; ++j) {
        for (Index l=0; l<mp; ++l) {
            for (Index i0=0; i0<MR; ++i0) {
                Index i  = l*MR + i0;
                Index nu = l*MR*kc + j*MR + i0;
                p[nu]   = (i<mc) ? A[i*incRowA+j*incColA]
                                 : T(0);
            }
        }
    }
}

template <typename Index, typename TB, typename T>
void
pack_B(Index kc, Index nc,
       const TB *B, Index incRowB, Index incColB,
       T *p)
{
    Index NR = BlockSize<T>::NR;
    Index np = (nc+NR-1) / NR;

    for (Index l=0; l<np; ++l) {
        for (Index j0=0; j0<NR; ++j0) {
            for (Index i=0; i<kc; ++i) {
                Index j  = l*NR+j0;
                Index nu = l*NR*kc + i*NR + j0;
                p[nu]   = (j<nc) ? B[i*incRowB+j*incColB]
                                 : T(0);
            }
        }
    }
}
//-- Frame routine -------------------------------------------------------------
template <typename Index, typename Alpha,
         typename TA, typename TB,
         typename Beta,
         typename TC>
void
gemm(Index m, Index n, Index k,
     Alpha alpha,
     const TA *A, Index incRowA, Index incColA,
     const TB *B, Index incRowB, Index incColB,
     Beta beta,
     TC *C, Index incRowC, Index incColC)
{
    typedef typename std::common_type<Alpha, TA, TB>::type  T;

    const Index MC = BlockSize<T>::MC;
    const Index NC = BlockSize<T>::NC;
    const Index MR = BlockSize<T>::MR;
    const Index NR = BlockSize<T>::NR;

    const Index KC = BlockSize<T>::KC;
    const Index mb = (m+MC-1) / MC;
    const Index nb = (n+NC-1) / NC;
    const Index kb = (k+KC-1) / KC;
    const Index mc_ = m % MC;
    const Index nc_ = n % NC;
    const Index kc_ = k % KC;

    T *A_ = (T*) malloc_(BlockSize<T>::align, sizeof(T)*(MC*KC+MR));
    T *B_ = (T*) malloc_(BlockSize<T>::align, sizeof(T)*(KC*NC+NR));

    if (alpha==Alpha(0) || k==0) {
        gescal(m, n, beta, C, incRowC, incColC);
        return;
    }

    for (Index j=0; j<nb; ++j) {
        Index nc = (j!=nb-1 || nc_==0) ? NC : nc_;

        for (Index l=0; l<kb; ++l) {
            Index   kc  = (l!=kb-1 || kc_==0) ? KC : kc_;
            Beta beta_  = (l==0) ? beta : Beta(1);

            pack_B(kc, nc,
                   &B[l*KC*incRowB+j*NC*incColB],
                   incRowB, incColB,
                   B_);

            for (Index i=0; i<mb; ++i) {
                Index mc = (i!=mb-1 || mc_==0) ? MC : mc_;

                pack_A(mc, kc,
                       &A[i*MC*incRowA+l*KC*incColA],
                       incRowA, incColA,
                       A_);

                mgemm(mc, nc, kc,
                      T(alpha), A_, B_, beta_,
                      &C[i*MC*incRowC+j*NC*incColC],
                      incRowC, incColC);
            }
        }
    }
    free_(A_);
    free_(B_);
}

//------------------------------------------------------------------------------

void myprod(double *c, const double* a, const double* b, int N) {
    gemm(N, N, N, 1.0, a, 1, N, b, 1, N, 0.0, c, 1, N);
}

int main(int argc, char* argv[]) {
  int N = atoi(argv[1]);
  int tries = 4;
  int rep = std::max<int>(1,10000000/N/N/N);

  Eigen::MatrixXd a_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd b_E = Eigen::MatrixXd::Random(N,N);
  Eigen::MatrixXd c_E(N,N);

  Eigen::BenchTimer t1, t2;

  BENCH(t1, tries, rep, c_E.noalias() = a_E*b_E );
  BENCH(t2, tries, rep, myprod(c_E.data(), a_E.data(), b_E.data(), N));

  std::cout << "Time taken by Eigen is: " << t1.best() << "\n";
  std::cout << "Time taken by for-loop is: " << t2.best() << "\n\n";
}




[matrix]相关文章推荐



                                                        
Matrix 是否等效于matlab'；是否存在用于Stata的xlswrite（）？
matrixstata 
Matrix 旅行推销员：矩阵与旅游
matrix 
Matrix 基于距离矩阵的javak-means术语聚类
matrix 
Matrix gnuplot矩阵或绘图：显示颜色和点值
matrixcolorsgnuplot 
Matrix 将距离矩阵可视化为图形
matrix 
Matrix 证明了非负对称矩阵行和的最小值保持不变
matrix 
Matrix isn'；一个float4乘以_world2的对象是一个4by4矩阵吗？
matrixunity3d 
Matrix 矩阵自定义字段的Rally filter下拉列表
matrixfilterrally 
Matrix Scilab-在矩阵中查找特定数字的对应行或列
matrix 
Matrix Fortran中的双复矩阵派生类型
matrixfortran 
Matrix 使用Julia计算网格中点处的矩阵表达式
matrixjulia 
Matrix Stata：将多个回归的系数/标准误差组合在一个数据集中（变量数量可能不同）
matrixstata 
Matrix Prolog如何求元素和矩阵
matrixprolog 
Matrix 凿子中的矩阵运算
matrix 
Matrix 如何将矩阵分解为两个矩阵之和的逆
matrix 
Matrix 如何在Rust中对矩阵进行索引选择
matrixrust 
Matrix 求解稀疏矩阵以仅包含下三角形中的值
matrix 
Matrix 英特尔扩展特征解算器（用于稀疏矩阵）速度极慢
matrixfortran 
Matrix gnuplot从统计矩阵中给出错误的结果
matrixstatisticsgnuplot 
Matrix K-Means聚类分析：需要一个脚本来识别每个集群中的领导者
matrix 
                                       





随机文章推荐



                                                        
Raspberry pi 带openelec xbmc的覆盆子附加组件savec在哪里
raspberry-pi 
Raspberry pi 如何使用pyserial和raspberry pi发送LIN节点侦听同步字节的中断
raspberry-pi 
Raspberry pi 需要校准树莓Pi B512上的DHT11湿度传感器，以获得准确度
raspberry-pi 
Raspberry pi 在启动文件夹中发生更改后，无法访问Raspberry pi启动分区
raspberry-pi 
Raspberry pi 如何在raspberry pi上将键盘布局从英语更改为我的本地语言？
raspberry-pi 
Raspberry pi 如何加速单线脚本
raspberry-pi 
Raspberry pi 未收到Android东西GpioCallback
raspberry-pi 
Raspberry pi 使用Exec节点显示温度节点红色树莓皮传感器
raspberry-pi 
Raspberry pi 7“；LCD显示器不适用于Android设备
raspberry-pi 
Raspberry pi 沟槽振动传感器与树莓pi的通信
raspberry-pi 
Raspberry pi 如何退出树莓Pi终端？I'；我试过cltr+；c但它不'；行不通
raspberry-pi 
Raspberry pi kworker在空闲系统上的CPU使用率很高
raspberry-pi 
Raspberry pi Electron-Linux-空白屏幕
raspberry-pielectron 
Raspberry pi Raspberry pi-获取已连接传感器的详细信息
raspberry-pi 
Raspberry pi Raspberry pi zero和Mini热敏打印机问题
raspberry-pi 
Raspberry pi raspberry pi 3车型B的汽车电池寿命
raspberry-pi 
Raspberry pi Raspbian是实时操作系统吗？
raspberry-pi 
Raspberry pi 覆盆子皮和U形靴
raspberry-pi 
Raspberry pi 如何配置Clion在启动前运行gdb命令？
raspberry-pigdb 
Raspberry pi 用于NXP微控制器（MKE06Z64VLH4）的带覆盆子的CAN更新程序
raspberry-piembedded