C++ 对FMA操作进行更积极的优化_C++_Gcc_Clang_Fma

C++ 对FMA操作进行更积极的优化

c++ gcc clang

C++ 对FMA操作进行更积极的优化,c++,gcc,clang,fma,C++,Gcc,Clang,Fma,我想构建一个表示多个（比如说N）算术类型的数据类型，并使用运算符重载提供与算术类型相同的接口，这样我就可以得到像Agner Fog这样的数据类型请看这个例子： #包括使用std：：size\u t；模板类LoopSIMD:std:：array { 公众： friend LoopSIMD运算符*（常数T a、常数LoopSIMD&x）{ 循环simd结果；对于（size_t i=0；i我做了以下操作，并且能够获得一些非常好的结果，对于gcc 10.2，使用与godbolt链接相同的-Of

我想构建一个表示多个（比如说

）算术类型的数据类型，并使用运算符重载提供与算术类型相同的接口，这样我就可以得到像Agner Fog这样的数据类型

请看这个例子：

#包括
使用std：：size\u t；
模板
类LoopSIMD:std:：array
{
公众：
friend LoopSIMD运算符*（常数T a、常数LoopSIMD&x）{
循环simd结果；
对于（size_t i=0；i我做了以下操作，并且能够获得一些非常好的结果，对于gcc 10.2，使用与godbolt链接相同的-Ofast-march=skylake-ffast math

friend LoopSIMD operator*(const T a, const LoopSIMD& x) {
    LoopSIMD result;
    std::transform(x.cbegin(), x.cend(), result.begin(),
                   [a](auto const& i) { return a * i; });
    return result;
}

LoopSIMD& operator+=(const LoopSIMD& x) {
    std::transform(this->cbegin(), this->cend(), x.cbegin(), this->begin(),
                   [](auto const& a, auto const& b) { return a + b; });
    return *this;
}

std:：transform
有一些疯狂的重载，所以我想我需要解释一下
第一个重载捕获a
，将每个值相乘，并将其存储回结果的开头
第二个重载充当zip
将x
和this
中的两个值相加，并将结果存储回this

如果您没有与操作员+=
和操作员*
结婚，您可以像这样创建自己的fma

    LoopSIMD& fma(const LoopSIMD& x, double a ){
        std::transform_inclusive_scan(
            x.cbegin(),
            x.cend(),
            this->begin(),
            std::plus{},
            [a](auto const& i){return i * a;},
            0.0);
        return *this;
    }

这需要c++17，但将循环保存SIMD指令
foo(double, LoopSIMD<double, 40ul>&, LoopSIMD<double, 40ul> const&):
        xor     eax, eax
        vxorpd  xmm1, xmm1, xmm1
.L2:
        vfmadd231sd     xmm1, xmm0, QWORD PTR [rsi+rax]
        vmovsd  QWORD PTR [rdi+rax], xmm1
        add     rax, 8
        cmp     rax, 320
        jne     .L2
        ret

foo（double，LoopSIMD&，LoopSIMD const&）：
异或eax，eax
vxorpd xmm1，xmm1，xmm1
.L2：
vfmad231sd xmm1，xmm0，QWORD PTR[rsi+rax]
vmovsd QWORD PTR[rdi+rax]，xmm1
加上rax，8
cmp rax，320
jne.L2
ret
您也可以简单地创建自己的fma功能：
template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD fma(const LoopSIMD& x, const T y, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(x[i], y, z[i]);
        }
        return result;
    }
    friend LoopSIMD fma(const T y, const LoopSIMD& x, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(y, x[i], z[i]);
        }
        return result;
    }
    // And more variants, taking `const LoopSIMD&, const LoopSIMD&, const T`, `const LoopSIMD&, const T, const T`, etc
};

SIMD foo(double a, SIMD x, SIMD y){
    return fma(a, y, x);
}

模板
类LoopSIMD:std:：array
{
公众：
friend LoopSIMD fma（常量LoopSIMD&x、常量T y、常量LoopSIMD&z）{
循环simd结果；
对于（尺寸i=0；i

但是，为了在第一时间实现更好的优化，您应该调整阵列。如果您执行以下操作，您的原始代码会得到很好的优化：
constexpr size_t next_power_of_2_not_less_than(size_t n) {
    size_t pow = 1;
    while (pow < n) pow *= 2;
    return pow;
}

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    // operators
} __attribute__((aligned(next_power_of_2_not_less_than(sizeof(T[S])))));

// Or with a c++11 attribute
/*
template<class T, size_t S>
class [[gnu::aligned(next_power_of_2_not_less_than(sizeof(T[S])))]] LoopSIMD : std::array<T,S>
{
public:
    // operators
};
*/

SIMD foo(double a, SIMD x, SIMD y){
    x += a * y;
    return x;
}

constexpr size\u t next\u power\u of 2\u不小于（size\u t n）{
尺寸功率=1；
而（功率
谢谢！但是您的代码中有一个错误。运算符调用中的std:：transform
调用必须是std:：transform（x.begin（），x.end（），result.begin（），…）；
。然后我得到了与我的示例相同的结果。啊，很好。我还忘了添加cbegin和cend迭代器，虽然它似乎确实改善了一些东西，但核心功能似乎没有太大变化。可能可以通过延迟求值、函数编程和内部函数来做一些事情，这可能会改善一些东西，但是我认为这需要做更多的工作，不要忘记编写<代码> y>代码> conf REF，你也会看到Foo也有相当大的减少。也许可以创建一个 x< /Cord>的副本，并使它成为const REF，这将有帮助！我知道如何用C++技术来解决这个问题。我认为一个替代方案是一个返回的表达式模板。通过操作符*
，然后在操作符+=中应用fma操作。但我的目标更多是针对问题中的编译器优化，因为我们已经运行了代码，我不想更改。
constexpr size_t next_power_of_2_not_less_than(size_t n) {
    size_t pow = 1;
    while (pow < n) pow *= 2;
    return pow;
}

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    // operators
} __attribute__((aligned(next_power_of_2_not_less_than(sizeof(T[S])))));

// Or with a c++11 attribute
/*
template<class T, size_t S>
class [[gnu::aligned(next_power_of_2_not_less_than(sizeof(T[S])))]] LoopSIMD : std::array<T,S>
{
public:
    // operators
};
*/

SIMD foo(double a, SIMD x, SIMD y){
    x += a * y;
    return x;
}