C++ 关于向量/矩阵/线性代数库的性能(c+;+;)

C++ 关于向量/矩阵/线性代数库的性能(c+;+;),c++,performance,optimization,linear-algebra,benchmarking,C++,Performance,Optimization,Linear Algebra,Benchmarking,我正在从头开始(在CPU上)编写一个机器学习库。它涉及对浮点向量的大量操作。我的目标是建立一个高效快速的图书馆。为此,我确定了一个关于向量的核心操作列表:元素加法、元素乘法、点积、元素sqrt/pow/exp。我的机器学习库的性能在很大程度上取决于这些操作的性能 我知道浮点向量运算的性能需要SIMD、指令流水线、缓存线优化和删除浮点IEEE标准检查(意味着删除非规范值、NaN/0+/0-检查等) 有很多C++库,它们有我需要的向量核心操作,它们经常利用BLAS(或OpenBLAS)来进行SIMD

我正在从头开始(在CPU上)编写一个机器学习库。它涉及对浮点向量的大量操作。我的目标是建立一个高效快速的图书馆。为此,我确定了一个关于向量的核心操作列表:元素加法、元素乘法、点积、元素sqrt/pow/exp。我的机器学习库的性能在很大程度上取决于这些操作的性能

我知道浮点向量运算的性能需要SIMD、指令流水线、缓存线优化和删除浮点IEEE标准检查(意味着删除非规范值、NaN/0+/0-检查等)

<>有很多C++库,它们有我需要的向量核心操作,它们经常利用BLAS(或OpenBLAS)来进行SIMD操作。我使用Armadillo有相当一段时间了,没有太多的基准测试和与其他库的比较。但是在对我的程序进行了一些分析之后,我发现该程序在openblas库中花费的时间微不足道。然后,我决定对不同库实现的性能进行真正的基准测试,并与基于
std::vector
的天真实现进行比较:


向量大小-10000浮动,操作数-1000000

元素添加

std::vector - 2s:477ms:183μs:102ns
armadillo - 2s:452ms:828μs:409ns
blaze - 2s:404ms:917μs:747ns
VCL(SSE2) - 1s:957ms:42μs:608ns
std::vector - 2s:438ms:7μs:792ns
armadillo - 2s:433ms:195μs:322ns
blaze - 2s:415ms:86μs:600ns
VCL(SSE2) - 1s:927ms:301μs:214ns
armadillo: 99.6462% -> void arma::arrayops::inplace_plus_base<float>(float*, float const*, unsigned long long)
blaze: 99.7094% -> main (meaning all blaze functions are inlined and it never go into openblas)
armadillo: 99.6596% -> void arma::arrayops::inplace_mul_base<float>(float*, float const*, unsigned long long)
blaze: 99.6650% -> main (same as above)
元素相乘

std::vector - 2s:477ms:183μs:102ns
armadillo - 2s:452ms:828μs:409ns
blaze - 2s:404ms:917μs:747ns
VCL(SSE2) - 1s:957ms:42μs:608ns
std::vector - 2s:438ms:7μs:792ns
armadillo - 2s:433ms:195μs:322ns
blaze - 2s:415ms:86μs:600ns
VCL(SSE2) - 1s:927ms:301μs:214ns
armadillo: 99.6462% -> void arma::arrayops::inplace_plus_base<float>(float*, float const*, unsigned long long)
blaze: 99.7094% -> main (meaning all blaze functions are inlined and it never go into openblas)
armadillo: 99.6596% -> void arma::arrayops::inplace_mul_base<float>(float*, float const*, unsigned long long)
blaze: 99.6650% -> main (same as above)
标量/点/内积

std::vector - 2s:681ms:335μs:990ns
armadillo - 2s:600ms:441μs:415ns
blaze - 2s:289ms:95μs:894ns
VCL(SSE2) - 3s:485ms:100μs:644ns
armadillo: 84.3580% -> /usr/lib/libopenblasp-r0.2.18.so
blaze: 84.8001% -> /usr/lib/libopenblasp-r0.2.18.so
比较库列表

犰狳(+openBLAS):

火焰(+openBLAS):

VCL:


令人惊讶的是,这种天真的实现速度与Armadillo和Blaze一样快(差别可以忽略不计)。VCL比其他元素加法和乘法实现快约23%。这主要是由于GCC的PGO(Profile-guideoptimizations),没有它,它的性能是相同的

因此,我的问题是:

  • 编译器似乎已经使用SIMD和指令管道优化了幼稚的实现。线性代数库真的能在这些核心运算上做得更好吗?(如果需要,我可以用其他库基准更新帖子)

  • 我的基准测试方法是否适合该任务?(代码如下)

基准代码-
vec bench.cpp

#include <armadillo>
#include <blaze/Math.h>
#include <vcl/vectorclass.h>
#include <chrono>
#include <string>
#include <sstream>
#include <iostream>
#include <vector>

// -----------------------BENCHMARK-FUNCTIONS---------------------------
typedef std::chrono::duration<int, std::ratio<86400>> days;

#define BENCHMARK_START \
{   \
    std::chrono::high_resolution_clock::time_point t1 = std::chrono::high_resolution_clock::now();

#define BENCHMARK_END(name, itCount) \
    std::chrono::high_resolution_clock::time_point t2 = std::chrono::high_resolution_clock::now(); \
    std::cout << "Benchmark for " << name << ":\n  " << hreadableUnit(std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1)) \
                        << " total\n  " << hreadableUnit(std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1) / itCount) \
                        << " average per iteration\n\n"; \
}

std::string hreadableUnit(std::chrono::nanoseconds ns) {
  auto d = std::chrono::duration_cast<days>(ns);
  ns -= d;
  auto h = std::chrono::duration_cast<std::chrono::hours>(ns);
  ns -= h;
  auto m = std::chrono::duration_cast<std::chrono::minutes>(ns);
  ns -= m;
  auto s = std::chrono::duration_cast<std::chrono::seconds>(ns);
  ns -= s;
  auto millis = std::chrono::duration_cast<std::chrono::milliseconds>(ns);
  ns -= millis;
  auto micros = std::chrono::duration_cast<std::chrono::microseconds>(ns);
  ns -= micros;

  std::stringstream ss;
  bool evenZero = false;
  if (d.count() || evenZero) {
    ss << d.count() << "d:";
    evenZero = true;
  }
  if (h.count() || evenZero) {
    ss << h.count() << "h:";
    evenZero = true;
  }
  if (m.count() || evenZero) {
    ss << m.count() << "m:";
    evenZero = true;
  }
  if (s.count() || evenZero) {
    ss << s.count() << "s:";
    evenZero = true;
  }
  if (millis.count() || evenZero) {
    ss << millis.count() << "ms:";
    evenZero = true;
  }
  if (micros.count() || evenZero) {
    ss << micros.count() << "μs:";
    evenZero = true;
  }
  if (ns.count() || evenZero) {
    ss << ns.count() << "ns";
    evenZero = true;
  }

  if (!evenZero)
    ss << "N/A";

  return ss.str();
}


// ----------------------ACTUAL-PROCESSING------------------------------
int main() {
    /* vector size, number of iterations */
    const unsigned int vSize = 10000, itCount = 1000000;
    float sum;

    /* std */
    std::vector<float> v(vSize), u(vSize);

    /* armadillo */
    arma::fvec av(vSize, arma::fill::randu), au(vSize, arma::fill::randu);
    unsigned int i, j;

    /* blaze */
    blaze::StaticVector<float, vSize, blaze::columnVector> bv, bu;

    /* SSE2 VCL (my processor cannot do better than SSE2) */
    unsigned int sse2VSize = vSize / 4, reminder = vSize % 4; 

    std::vector<Vec4f> vclv(sse2VSize), vclu(sse2VSize);
    std::vector<float> vclrv(reminder), vclru(reminder); /* reminder */

    std::cout << "----------------ELEMENT-WISE-ADDITION-----------------\n";
    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < vSize; ++j)
                v[j] += u[j];
    BENCHMARK_END("standard vector addition", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            av += au;
    BENCHMARK_END("armadillo vector addition", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            bv += bu;
    BENCHMARK_END("blaze vector addition", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < sse2VSize; ++j)
                vclv[j] += vclu[j];

        if (reminder) {
            for (i = 0; i < itCount; ++i)
                for (j = 0; j < reminder; ++j)
                    vclrv[j] += vclru[j];
        }
    BENCHMARK_END("VCL(SSE2 - 4floats) vector addition", itCount)


    std::cout << "-------------ELEMENT-WISE-MULTIPLICATION--------------\n";
    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < vSize; ++j)
                v[j] *= u[j];
    BENCHMARK_END("standard vector element wise multiplication", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            av %= au;
    BENCHMARK_END("armadillo vector element wise multiplication", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            bv *= bu;
    BENCHMARK_END("blaze vector element wise multiplication", itCount)

    BENCHMARK_START
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < sse2VSize; ++j)
                vclv[j] *= vclu[j];

        if (reminder) {
            for (i = 0; i < itCount; ++i)
                for (j = 0; j < reminder; ++j)
                    vclrv[j] *= vclru[j];
        }
    BENCHMARK_END("VCL(SSE2 - 4floats) element wise multiplication", itCount)


    std::cout << "------------SCALAR / DOT / INNER PRODUCT-------------\n";
    BENCHMARK_START
        sum = 0;
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < vSize; ++j)
                sum += (v[j] * u[j]);

        /* force the compiler to not remove 
         * the calculation by using the 'sum' 
         * variable */
        std::cout << sum << "\n";
    BENCHMARK_END("standard vector scalar / dot / inner product", itCount)

    BENCHMARK_START
        auto aut = au.t(); /* transposed */
        sum = 0;
        for (i = 0; i < itCount; ++i)
            sum += as_scalar(av * aut);
        std::cout << sum << "\n";
    BENCHMARK_END("armadillo vector scalar / dot / inner product", itCount)

    BENCHMARK_START
        blaze::StaticVector<float, vSize, blaze::rowVector> but = blaze::trans(bu); /* transposed */
        sum = 0;
        for (i = 0; i < itCount; ++i)
            sum += blaze::dotu(bv, but);
        std::cout << sum << "\n";
    BENCHMARK_END("blaze vector scalar / dot / inner product", itCount)

    BENCHMARK_START
        sum = 0;
        for (i = 0; i < itCount; ++i)
            for (j = 0; j < sse2VSize; ++j)
                sum += horizontal_add((vclv[j] * vclu[j]));

        if (reminder) {
            for (i = 0; i < itCount; ++i)
                for (j = 0; j < reminder; ++j)
                    sum += (vclrv[j] * vclru[j]);
        }
        std::cout << sum << "\n";
    BENCHMARK_END("VCL(SSE2 - 4floats) scalar / dot / inner product", itCount)

    return 0;
} 
编辑

std::vector - 2s:681ms:335μs:990ns
armadillo - 2s:600ms:441μs:415ns
blaze - 2s:289ms:95μs:894ns
VCL(SSE2) - 3s:485ms:100μs:644ns
armadillo: 84.3580% -> /usr/lib/libopenblasp-r0.2.18.so
blaze: 84.8001% -> /usr/lib/libopenblasp-r0.2.18.so
Oprofile数据:

元素添加

std::vector - 2s:477ms:183μs:102ns
armadillo - 2s:452ms:828μs:409ns
blaze - 2s:404ms:917μs:747ns
VCL(SSE2) - 1s:957ms:42μs:608ns
std::vector - 2s:438ms:7μs:792ns
armadillo - 2s:433ms:195μs:322ns
blaze - 2s:415ms:86μs:600ns
VCL(SSE2) - 1s:927ms:301μs:214ns
armadillo: 99.6462% -> void arma::arrayops::inplace_plus_base<float>(float*, float const*, unsigned long long)
blaze: 99.7094% -> main (meaning all blaze functions are inlined and it never go into openblas)
armadillo: 99.6596% -> void arma::arrayops::inplace_mul_base<float>(float*, float const*, unsigned long long)
blaze: 99.6650% -> main (same as above)

因此armadillo和blaze仅对dot产品使用BLAS。

您可能会期望样本运行之间存在一些差异,因此仅对单个运行进行基准测试并不能提供太多洞察。其次,您基本上是在对BLAS/openBLAS进行基准测试,而不是Armadillo或Blaze——两者都应该遵从BLAS进行实际计算。如果您想了解/分析开销,但如果您想将原始线性代数性能与天真的实现进行比较,则此功能非常有用。注意:Armadillo还允许您在其他后端进行交换,这可能很有趣。第三:你有没有考虑过并行化的机会?dot product的时间越长,说明你没有使用足够的累加器-dot product有一个恼人的依赖结构,所以你需要很多累加器来隐藏FMA延迟(或者,没有FMA,添加延迟)。元素操作很难改进,您可以尝试内存技巧(流、预取、TLB启动等)@user268396 I采取了一些措施,但它不会改变结论。在你们的第二次观察中,我添加了oprofile数据,显示犰狳和blaze只对点积使用BLAS。我将尝试用ATLAS交换dot产品的后端。关于你的第三点,只要我的库中的算法级瓶颈可能,我就使用并行化。@harold我尝试了4个累加器的点积,但没有帮助。向量的大小是10k,我想这就是原因。我将尝试使用块算法来滥用缓存线大小(但在我的lib中,这对这个玩具示例没有帮助)。Thx对于ideas.10k不是那么大,应该很容易安装在LLC中。4个蓄能器仍然不是很多,例如对于Haswell,您需要10个(FMA延迟为5,吞吐量为2/周期)