C++ 关于向量/矩阵/线性代数库的性能（c+；+；）_C++_Performance_Optimization_Linear Algebra_Benchmarking

C++ 关于向量/矩阵/线性代数库的性能（c+；+；）

c++ performance optimization

C++ 关于向量/矩阵/线性代数库的性能（c+；+；）,c++,performance,optimization,linear-algebra,benchmarking,C++,Performance,Optimization,Linear Algebra,Benchmarking,我正在从头开始（在CPU上）编写一个机器学习库。它涉及对浮点向量的大量操作。我的目标是建立一个高效快速的图书馆。为此，我确定了一个关于向量的核心操作列表：元素加法、元素乘法、点积、元素sqrt/pow/exp。我的机器学习库的性能在很大程度上取决于这些操作的性能我知道浮点向量运算的性能需要SIMD、指令流水线、缓存线优化和删除浮点IEEE标准检查（意味着删除非规范值、NaN/0+/0-检查等）有很多C++库，它们有我需要的向量核心操作，它们经常利用BLAS（或OpenBLAS）来进行SIMD

我正在从头开始（在CPU上）编写一个机器学习库。它涉及对浮点向量的大量操作。我的目标是建立一个高效快速的图书馆。为此，我确定了一个关于向量的核心操作列表：元素加法、元素乘法、点积、元素sqrt/pow/exp。我的机器学习库的性能在很大程度上取决于这些操作的性能

我知道浮点向量运算的性能需要SIMD、指令流水线、缓存线优化和删除浮点IEEE标准检查（意味着删除非规范值、NaN/0+/0-检查等）

<>有很多C++库，它们有我需要的向量核心操作，它们经常利用BLAS（或OpenBLAS）来进行SIMD操作。我使用Armadillo有相当一段时间了，没有太多的基准测试和与其他库的比较。但是在对我的程序进行了一些分析之后，我发现该程序在openblas库中花费的时间微不足道。然后，我决定对不同库实现的性能进行真正的基准测试，并与基于

std:：vector

的天真实现进行比较：

向量大小-10000浮动，操作数-1000000

元素添加

std::vector - 2s:477ms:183μs:102ns
armadillo - 2s:452ms:828μs:409ns
blaze - 2s:404ms:917μs:747ns
VCL(SSE2) - 1s:957ms:42μs:608ns

std::vector - 2s:438ms:7μs:792ns
armadillo - 2s:433ms:195μs:322ns
blaze - 2s:415ms:86μs:600ns
VCL(SSE2) - 1s:927ms:301μs:214ns

armadillo: 99.6462% -> void arma::arrayops::inplace_plus_base<float>(float*, float const*, unsigned long long)
blaze: 99.7094% -> main (meaning all blaze functions are inlined and it never go into openblas)

armadillo: 99.6596% -> void arma::arrayops::inplace_mul_base<float>(float*, float const*, unsigned long long)
blaze: 99.6650% -> main (same as above)

元素相乘

std::vector - 2s:477ms:183μs:102ns
armadillo - 2s:452ms:828μs:409ns
blaze - 2s:404ms:917μs:747ns
VCL(SSE2) - 1s:957ms:42μs:608ns

std::vector - 2s:438ms:7μs:792ns
armadillo - 2s:433ms:195μs:322ns
blaze - 2s:415ms:86μs:600ns
VCL(SSE2) - 1s:927ms:301μs:214ns

armadillo: 99.6462% -> void arma::arrayops::inplace_plus_base<float>(float*, float const*, unsigned long long)
blaze: 99.7094% -> main (meaning all blaze functions are inlined and it never go into openblas)

armadillo: 99.6596% -> void arma::arrayops::inplace_mul_base<float>(float*, float const*, unsigned long long)
blaze: 99.6650% -> main (same as above)

标量/点/内积

std::vector - 2s:681ms:335μs:990ns
armadillo - 2s:600ms:441μs:415ns
blaze - 2s:289ms:95μs:894ns
VCL(SSE2) - 3s:485ms:100μs:644ns

armadillo: 84.3580% -> /usr/lib/libopenblasp-r0.2.18.so
blaze: 84.8001% -> /usr/lib/libopenblasp-r0.2.18.so

比较库列表

犰狳（+openBLAS）：

火焰（+openBLAS）：

VCL：

令人惊讶的是，这种天真的实现速度与Armadillo和Blaze一样快（差别可以忽略不计）。VCL比其他元素加法和乘法实现快约23%。这主要是由于GCC的PGO（Profile-guideoptimizations），没有它，它的性能是相同的

因此，我的问题是：

编译器似乎已经使用SIMD和指令管道优化了幼稚的实现。线性代数库真的能在这些核心运算上做得更好吗？（如果需要，我可以用其他库基准更新帖子）
我的基准测试方法是否适合该任务？（代码如下）

基准代码-

vec bench.cpp

：

#include <armadillo>
#include <blaze/Math.h>
#include <vcl/vectorclass.h>
#include <chrono>
#include <string>
#include <sstream>
#include <iostream>
#include <vector>

// -----------------------BENCHMARK-FUNCTIONS---------------------------
typedef std::chrono::duration<int, std::ratio<86400>> days;

#define BENCHMARK_START \
{   \
    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();

#define BENCHMARK_END(name, itCount) \
    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now(); \
    std::cout << "Benchmark for " << name << ":\n  " << hreadableUnit(std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1)) \
                        << " total\n  " << hreadableUnit(std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1) / itCount) \
                        << " average per iteration\n\n"; \
}

std::string hreadableUnit(std::chrono::nanoseconds ns) {
  auto d = std::chrono::duration_cast<days>(ns);
  ns -= d;
  auto h = std::chrono::duration_cast<std::chrono::hours>(ns);
  ns -= h;
  auto m = std::chrono::duration_cast<std::chrono::minutes>(ns);
  ns -= m;
  auto s = std::chrono::duration_cast<std::chrono::seconds>(ns);
  ns -= s;
  auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(ns);
  ns -= millis;
  auto micros = std::chrono::duration_cast<std::chrono::microseconds>(ns);
  ns -= micros;

  std::stringstream ss;
  bool evenZero = false;
  if (d.count() || evenZero) {
    ss << d.count() << "d:";
    evenZero = true;
  }
  if (h.count() || evenZero) {
    ss << h.count() << "h:";
    evenZero = true;
  }
  if (m.count() || evenZero) {
    ss << m.count() << "m:";
    evenZero = true;
  }
  if (s.count() || evenZero) {
    ss << s.count() << "s:";
    evenZero = true;
  }
  if (millis.count() || evenZero) {
    ss << millis.count() << "ms:";
    evenZero = true;
  }
  if (micros.count() || evenZero) {
    ss << micros.count() << "μs:";
    evenZero = true;
  }
  if (ns.count() || evenZero) {
    ss << ns.count() << "ns";
    evenZero = true;
  }

  if (!evenZero)
    ss << "N/A";

  return ss.str();
}


// ----------------------ACTUAL-PROCESSING------------------------------
int main() {
    /* vector size, number of iterations */
    const unsigned int vSize = 10000, itCount = 1000000;
    float sum;

    /* std */
    std::vector<float> v(vSize), u(vSize);

    /* armadillo */
    arma::fvec av(vSize, arma::fill::randu), au(vSize, arma::fill::randu);
    unsigned int i, j;

    /* blaze */
    blaze::StaticVector<float, vSize, blaze::columnVector> bv, bu;

    /* SSE2 VCL (my processor cannot do better than SSE2) */
    unsigned int sse2VSize = vSize / 4, reminder = vSize % 4; 

    std::vector<Vec4f> vclv(sse2VSize), vclu(sse2VSize);
    std::vector<float> vclrv(reminder), vclru(reminder); /* reminder */

    std::cout << "----------------ELEMENT-WISE-ADDITION-----------------\n";
    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < vSize; ++j)
                v[j] += u[j];
    BENCHMARK_END("standard vector addition", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            av += au;
    BENCHMARK_END("armadillo vector addition", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            bv += bu;
    BENCHMARK_END("blaze vector addition", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < sse2VSize; ++j)
                vclv[j] += vclu[j];

        if (reminder) {
            for (i = 0; i < itCount; ++i)
                for (j = 0; j < reminder; ++j)
                    vclrv[j] += vclru[j];
        }
    BENCHMARK_END("VCL(SSE2 - 4floats) vector addition", itCount)


    std::cout << "-------------ELEMENT-WISE-MULTIPLICATION--------------\n";
    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < vSize; ++j)
                v[j] *= u[j];
    BENCHMARK_END("standard vector element wise multiplication", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            av %= au;
    BENCHMARK_END("armadillo vector element wise multiplication", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            bv *= bu;
    BENCHMARK_END("blaze vector element wise multiplication", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < sse2VSize; ++j)
                vclv[j] *= vclu[j];

        if (reminder) {
            for (i = 0; i < itCount; ++i)
                for (j = 0; j < reminder; ++j)
                    vclrv[j] *= vclru[j];
        }
    BENCHMARK_END("VCL(SSE2 - 4floats) element wise multiplication", itCount)


    std::cout << "------------SCALAR / DOT / INNER PRODUCT-------------\n";
    BENCHMARK_START
        sum = 0;
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < vSize; ++j)
                sum += (v[j] * u[j]);

        /* force the compiler to not remove 
         * the calculation by using the 'sum' 
         * variable */
        std::cout << sum << "\n";
    BENCHMARK_END("standard vector scalar / dot / inner product", itCount)

    BENCHMARK_START
        auto aut = au.t(); /* transposed */
        sum = 0;
        for (i = 0; i < itCount; ++i)
            sum += as_scalar(av * aut);
        std::cout << sum << "\n";
    BENCHMARK_END("armadillo vector scalar / dot / inner product", itCount)

    BENCHMARK_START
        blaze::StaticVector<float, vSize, blaze::rowVector> but = blaze::trans(bu); /* transposed */
        sum = 0;
        for (i = 0; i < itCount; ++i)
            sum += blaze::dotu(bv, but);
        std::cout << sum << "\n";
    BENCHMARK_END("blaze vector scalar / dot / inner product", itCount)

    BENCHMARK_START
        sum = 0;
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < sse2VSize; ++j)
                sum += horizontal_add((vclv[j] * vclu[j]));

        if (reminder) {
            for (i = 0; i < itCount; ++i)
                for (j = 0; j < reminder; ++j)
                    sum += (vclrv[j] * vclru[j]);
        }
        std::cout << sum << "\n";
    BENCHMARK_END("VCL(SSE2 - 4floats) scalar / dot / inner product", itCount)

    return 0;
}

编辑

std::vector - 2s:681ms:335μs:990ns
armadillo - 2s:600ms:441μs:415ns
blaze - 2s:289ms:95μs:894ns
VCL(SSE2) - 3s:485ms:100μs:644ns

armadillo: 84.3580% -> /usr/lib/libopenblasp-r0.2.18.so
blaze: 84.8001% -> /usr/lib/libopenblasp-r0.2.18.so

Oprofile数据：

元素添加

std::vector - 2s:477ms:183μs:102ns
armadillo - 2s:452ms:828μs:409ns
blaze - 2s:404ms:917μs:747ns
VCL(SSE2) - 1s:957ms:42μs:608ns

std::vector - 2s:438ms:7μs:792ns
armadillo - 2s:433ms:195μs:322ns
blaze - 2s:415ms:86μs:600ns
VCL(SSE2) - 1s:927ms:301μs:214ns

armadillo: 99.6462% -> void arma::arrayops::inplace_plus_base<float>(float*, float const*, unsigned long long)
blaze: 99.7094% -> main (meaning all blaze functions are inlined and it never go into openblas)

armadillo: 99.6596% -> void arma::arrayops::inplace_mul_base<float>(float*, float const*, unsigned long long)
blaze: 99.6650% -> main (same as above)

因此armadillo和blaze仅对dot产品使用BLAS。

您可能会期望样本运行之间存在一些差异，因此仅对单个运行进行基准测试并不能提供太多洞察。其次，您基本上是在对BLAS/openBLAS进行基准测试，而不是Armadillo或Blaze——两者都应该遵从BLAS进行实际计算。如果您想了解/分析开销，但如果您想将原始线性代数性能与天真的实现进行比较，则此功能非常有用。注意：Armadillo还允许您在其他后端进行交换，这可能很有趣。第三：你有没有考虑过并行化的机会？dot product的时间越长，说明你没有使用足够的累加器-dot product有一个恼人的依赖结构，所以你需要很多累加器来隐藏FMA延迟（或者，没有FMA，添加延迟）。元素操作很难改进，您可以尝试内存技巧（流、预取、TLB启动等）@user268396 I采取了一些措施，但它不会改变结论。在你们的第二次观察中，我添加了oprofile数据，显示犰狳和blaze只对点积使用BLAS。我将尝试用ATLAS交换dot产品的后端。关于你的第三点，只要我的库中的算法级瓶颈可能，我就使用并行化。@harold我尝试了4个累加器的点积，但没有帮助。向量的大小是10k，我想这就是原因。我将尝试使用块算法来滥用缓存线大小（但在我的lib中，这对这个玩具示例没有帮助）。Thx对于ideas.10k不是那么大，应该很容易安装在LLC中。4个蓄能器仍然不是很多，例如对于Haswell，您需要10个（FMA延迟为5，吞吐量为2/周期）