C++ 使用犰狳矢量运行相同的计算两次会得到不同的结果_C++_Armadillo

C++ 使用犰狳矢量运行相同的计算两次会得到不同的结果

c++

C++ 使用犰狳矢量运行相同的计算两次会得到不同的结果,c++,armadillo,C++,Armadillo,我有以下测试功能： std::pair<double, double> test_speed_vmv(const size_t size) { const size_t rounds = 10000; arma::colvec a = arma::colvec(size, arma::fill::randn); arma::rowvec b = arma::rowvec(size, arma::fill::randn); arma::colvec b1

我有以下测试功能：

std::pair<double, double> test_speed_vmv(const size_t size)
{
    const size_t rounds = 10000;
    arma::colvec a = arma::colvec(size, arma::fill::randn);
    arma::rowvec b = arma::rowvec(size, arma::fill::randn);
    arma::colvec b1 = arma::colvec(size);
    arma::colvec c = arma::colvec(size, arma::fill::zeros);
    arma::colvec d = arma::colvec(size, arma::fill::zeros);
    arma::colvec e = arma::colvec(size, arma::fill::zeros);
    arma::mat A = arma::mat(size, size, arma::fill::ones);
    for(size_t i = 0; i < size; ++i)
        b1[i] = b[i];
    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();
    for(size_t j = 0; j < rounds; ++j)
        c = A * a;
    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now();
    std::chrono::high_resolution_clock::time_point t3 = std::chrono::high_resolution_clock::now();
    for(size_t j = 0; j < rounds; ++j)
    {
        e = A * a;
    }
    std::chrono::high_resolution_clock::time_point t4 = std::chrono::high_resolution_clock::now();
    for(size_t i = 0; i < size; ++i)
        std::cout << c[i] << '\t' << e[i] << '\n';
    auto duration_avx = std::chrono::duration_cast<std::chrono::microseconds>(t2 - t1).count()/rounds;
    auto duration_arma = std::chrono::duration_cast<std::chrono::microseconds>(t4 - t3).count()/rounds;
    if(arma::approx_equal(c, e, "absdiff", 1e-3) == 1)
        return std::make_pair<double, double>(duration_avx, duration_arma);
    else
        return std::make_pair<double, double>(-1, -1);
}

几次运行的时间是一致的。但是为什么对于长度超过96个元素的向量会得到不同的结果呢

编译是使用

g++ -I -O2 -ftree-vectorize -mavx2 -funroll-loops -g -march=native -std=gnu++17 -fopenmp -c avx2_test.cpp -o avx2_test.o
g++ -lm -larmadillo -lgomp -lpthread -lX11 -L/opt/boost/lib -lboost_system -L/opt/intel/mkl/lib/intel64 -lmkl_rt avx2_test.o -o avx2

你是在问为什么时间不同吗？两者都有。我希望时间大约相等，但对于80的向量长度，速度不会慢80倍。当使用相同的代码查看时，

的计算与

的计算相同，因此编译器（可能）选择进行一次计算，并复制结果，但对于所有操作，结果应该是相同的，对于长度>96个元素，结果不会不同。您是否想知道为什么计时不同？两者都有。我希望时间大约相等，但对于80的向量长度，速度不会慢80倍。当使用相同的代码查看时，

的计算与

的计算相同，因此编译器（可能）选择进行一次计算，并复制结果，但对于所有操作，结果应该相同，对于长度>96个元素，结果不会不同

Multiplication of a matrix with a vector with a length of 2 took 0 for a single line and 0 for two lines, resulting in a single line being 0 times faster
Multiplication of a matrix with a vector with a length of 5 took 39 for a single line and 0 for two lines, resulting in a single line being 0 times faster
Multiplication of a matrix with a vector with a length of 10 took 0 for a single line and 0 for two lines, resulting in a single line being 0 times faster
Multiplication of a matrix with a vector with a length of 80 took 2250 for a single line and 1 for two lines, resulting in a single line being 0.000444444 times faster
Multiplication of a matrix with a vector with a length of 95 took 1 for a single line and 1 for two lines, resulting in a single line being 1 times faster
Multiplication of a matrix with a vector with a length of 96 took -1 for a single line and -1 for two lines, resulting in a single line being 1 times faster
Multiplication of a matrix with a vector with a length of 99 took -1 for a single line and -1 for two lines, resulting in a single line being 1 times faster
Multiplication of a matrix with a vector with a length of 100 took -1 for a single line and -1 for two lines, resulting in a single line being 1 times faster
Multiplication of a matrix with a vector with a length of 128 took -1 for a single line and -1 for two lines, resulting in a single line being 1 times faster
Multiplication of a matrix with a vector with a length of 256 took -1 for a single line and -1 for two lines, resulting in a single line being 1 times faster
Multiplication of a matrix with a vector with a length of 512 took -1 for a single line and -1 for two lines, resulting in a single line being 1 times faster
Multiplication of a matrix with a vector with a length of 1000 took -1 for a single line and -1 for two lines, resulting in a single line being 1 times faster

g++ -I -O2 -ftree-vectorize -mavx2 -funroll-loops -g -march=native -std=gnu++17 -fopenmp -c avx2_test.cpp -o avx2_test.o
g++ -lm -larmadillo -lgomp -lpthread -lX11 -L/opt/boost/lib -lboost_system -L/opt/intel/mkl/lib/intel64 -lmkl_rt avx2_test.o -o avx2