C++ 为什么'；t g++；完全优化这些循环/操作员呼叫？_C++_Assembly_Optimization_G++_Inlining

C++ 为什么'；t g++；完全优化这些循环/操作员呼叫？

c++ assembly optimization

C++ 为什么'；t g++；完全优化这些循环/操作员呼叫？,c++,assembly,optimization,g++,inlining,C++,Assembly,Optimization,G++,Inlining,考虑这个struct，例如，它可以表示2个4D向量的结构： struct A { double x[4]; double y[4]; A() : A(0.0, 0.0) { } A(double xp, double yp) { std::fill_n(x, 4, xp); std::fill_n(y, 4, yp); } // Simple element-wise delegation of the

考虑这个

struct

，例如，它可以表示2个4D向量的结构：

struct A {
    double x[4];
    double y[4];

    A() : A(0.0, 0.0) { }
    A(double xp, double yp)
    {
        std::fill_n(x, 4, xp);
        std::fill_n(y, 4, yp);
    }

    // Simple element-wise delegation of the mathematical operations
    friend A operator+(const A &l, const A &r) 
    {
        A res;
        for (int i = 0; i < 4; i++)
        {
            res.x[i] = l.x[i] + r.x[i];
            res.y[i] = l.y[i] + r.y[i];
        }
        return res;
    }
    friend A operator*(const A &l, const double &r) 
    {
        A res;
        for (int i = 0; i < 4; i++)
        {
            res.x[i] = l.x[i] * r;
            res.y[i] = l.y[i] * r;
        }
        return res;
    }
    friend A operator*(const double &l, const A &r) 
    {
        A res;
        for (int i = 0; i < 4; i++)
        {
            res.x[i] = l * r.x[i];
            res.y[i] = l * r.y[i];
        }
        return res;
    }
    friend std::ostream &operator<<(std::ostream &stream, const A &a)
    {
        for (int i = 0; i < 4; i++)
            std::cout << "(" << a.x[i] << "|" << a.y[i] << ") ";
        return stream;
    }
};

及

我已经在godbolt中准备了这两种情况的示例。这两个案例都是在优化级别

-O3

上使用g++7.1.0编译的。左边的大小写对应于

struct B

的版本1，右边的大小写对应于版本2

正如您在反汇编中看到的，编译器为版本1生成两个标签，它们对应于

struct B

中的

mathX

函数：

64b:：mathA（int，double）：

[…]

76 B:：mathB（）：

如我的分析所示，第一个示例比第二个示例慢得多。在我的实际代码中，这些函数被调用了超过10亿次，因此对整个运行时有很大的贡献。我假设这部分是由于跳转到函数定义

有没有办法强制编译器生成与第二个示例相同的程序集？即使用运算符的定义

更新由于编译器似乎为

mathX（…）

生成标签和跳转，我的想法是尝试内联这些函数。使用

inline

关键字不会改变任何内容，但对于g++来说，可以使用

\uuuuu属性（始终为inline））

，这将强制编译器内联函数（）：

这提高了性能，现在性能介于版本1和版本2之间。这仍然不是完美的，但如果找不到更好的解决方案，我将使用这个解决方案。

我在-O2上尝试了版本1和版本2，而clang似乎生成了完全相同的程序集，并将其内联。

struct B { // version 1
    double f1; 
    double f2; // Two coefficients
    A buff1;
    A buff2;
    A buffa[4]; // Objects of struct A
    // The following functions use the operators defined on struct A
    void mathA(int i, double d) // Some math operations
    {
        buff2 = buff1 + buffa[i] * d;
    }
    void mathB() // Some more math (vector) operations
    {
        buff1 = f1 * (buffa[0] + buffa[3]) + f2 * (buffa[1] + buffa[2]);
    }
};

struct B { // version 2
    double f1; 
    double f2; // Two coefficients
    A buff1;
    A buff2;
    A buffa[4]; // Objects of struct A
    // The following functions DO NOT use the operators defined on struct A
    void mathA(int i, double d) // Some math operations
    {
        for (int j = 0; j < 4; j++)
        {
            buff2.x[j] = buff1.x[j] + buffa[i].x[j] * d;
            buff2.y[j] = buff1.y[j] + buffa[i].y[j] * d;
        }
    }
    void mathB() // Some more math (vector) operations
    {
        for (int j = 0; j < 4; j++)
        {
            buff1.x[j] = f1 * (buffa[0].x[j] + buffa[3].x[j]) + f2 * (buffa[1].x[j] + buffa[2].x[j]);
            buff1.y[j] = f1 * (buffa[0].y[j] + buffa[3].y[j]) + f2 * (buffa[1].y[j] + buffa[2].y[j]);
        }
    }
};

int main(int argc, char **argv)
{
    B b;
    b.f1 = 0.5;
    b.f2 = 0.8;
    b.buff1 = A(0.7, 0.8);
    b.buff2 = A(1.7, 2.8);

    b.mathA(1, 0.9);
    b.mathB();

    std::cout << b.buff1 << "\n" << b.buff2;
}

struct B { // version 3
    // …
    mathA(int i, double d) __attribute__((always_inline))
    {
        // …
    }
    mathB() __attribute__((always_inline))
    {
        // …
    }
};