为什么赢了'；GCC是否自动向量化此循环？_C_Gcc_Vectorization

为什么赢了'；GCC是否自动向量化此循环？

c gcc

为什么赢了'；GCC是否自动向量化此循环？,c,gcc,vectorization,C,Gcc,Vectorization,我有下面的C程序（我的实际用例的简化，展示了相同的行为）我得到输出 main.c:10: note: not vectorized: unhandled data-ref 其中10是内部for循环的线。当我查看为什么它可能会这样说时，它似乎在说指针可能有别名，但它们不能出现在我的代码中，因为我有_restrict关键字。他们还建议加入-msse标志，但他们似乎也没有做任何事情。有什么帮助吗？试试： const float * __restrict__ input = ...; float

我有下面的C程序（我的实际用例的简化，展示了相同的行为）

我得到输出

main.c:10: note: not vectorized: unhandled data-ref

其中10是内部for循环的线。当我查看为什么它可能会这样说时，它似乎在说指针可能有别名，但它们不能出现在我的代码中，因为我有_restrict关键字。他们还建议加入-msse标志，但他们似乎也没有做任何事情。有什么帮助吗？

试试：

const float * __restrict__ input = ...;
float * __restrict__ output = ...;

通过改变周围的事物进行一点实验：

#include <stdlib.h>
#include <math.h>

int main(int argc, char ** argv) {

    const float * __restrict__ input = new float[20000];
    float * __restrict__  output = new float[20000];

    unsigned int pos=0;
    while(1) {
        unsigned int rest=100;
        output += pos;
        input += pos;
        for(unsigned int i=0;i<rest; ++i) {
            output[i] = input[i] * 0.1;
        }

        pos+=rest;
        if(pos>10000) {
            break;
        }
    }
}

g++ -O3 -g -Wall -ftree-vectorizer-verbose=7 -msse -msse2 -msse3 -c test.cpp

test.cpp:14: note: versioning for alias required: can't determine dependence between *D.4096_24 and *D.4095_21
test.cpp:14: note: mark for run-time aliasing test between *D.4096_24 and *D.4095_21
test.cpp:14: note: Alignment of access forced using versioning.
test.cpp:14: note: Vectorizing an unaligned access.
test.cpp:14: note: vect_model_load_cost: unaligned supported by hardware.
test.cpp:14: note: vect_model_load_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 0 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 2, outside_cost = 1 .
test.cpp:14: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: vect_model_store_cost: inside_cost = 1, outside_cost = 0 .
test.cpp:14: note: cost model: Adding cost of checks for loop versioning to treat misalignment.

test.cpp:14: note: cost model: Adding cost of checks for loop versioning aliasing.

test.cpp:14: note: Cost model analysis:
  Vector inside of loop cost: 8
  Vector outside of loop cost: 6
  Scalar iteration cost: 5
  Scalar outside cost: 1
  prologue iterations: 0
  epilogue iterations: 0
  Calculated minimum iters for profitability: 2

test.cpp:14: note:   Profitability threshold = 3

test.cpp:14: note: Vectorization may not be profitable.
test.cpp:14: note: create runtime check for data references *D.4096_24 and *D.4095_21
test.cpp:14: note: created 1 versioning for alias checks.

test.cpp:14: note: LOOP VECTORIZED.
test.cpp:4: note: vectorized 1 loops in function.

Compilation finished at Wed Feb 16 19:17:59

#包括
#包括
int main（int argc，字符**argv）{
常量浮点*限制输入=新浮点[20000]；
浮点*限制输出=新浮点[20000]；
无符号整数pos=0；
而(1){
无符号整数rest=100；
输出+=位置；
输入+=位置；
for（无符号整数i=0；i10000）{
打破
}
}
}
g++-O3-g-Wall-ftree向量器verbose=7-msse-msse2-msse3-c test.cpp
test.cpp:14：注意：需要别名的版本控制：无法确定*D.4096_24和*D.4095_21之间的依赖关系
test.cpp:14：注意：标记*D.4096_24和*D.4095_21之间的运行时别名测试
test.cpp:14：注意：使用版本控制强制对齐访问。
test.cpp:14：注意：对未对齐的访问进行矢量化。
test.cpp:14：注意：vect\u model\u load\u成本：硬件不支持对齐。
test.cpp:14：注：向量模型加载成本：内部成本=2，外部成本=0。
test.cpp:14：注：向量模型简单成本：内部成本=2，外部成本=0。
测试cpp:14：注：向量模型简单成本：内部成本=2，外部成本=1。
test.cpp:14：注：向量模型简单成本：内部成本=1，外部成本=0。
test.cpp:14：注：向量模型存储成本：内部成本=1，外部成本=0。
test.cpp:14：注意：成本模型：为循环版本控制添加检查成本，以处理未对齐。
test.cpp:14：注意：成本模型：为循环版本控制别名添加检查成本。
测试。cpp:14：注：成本模型分析：
循环内向量成本：8
向量环外成本：6
标量迭代成本：5
标量外部成本：1
序幕迭代：0
结语：0
计算出的最低ITER盈利能力：2
测试cpp:14：注：盈利能力阈值=3
测试cpp:14：注：矢量化可能无利可图。
test.cpp:14：注意：为数据引用*D.4096_24和*D.4095_21创建运行时检查
test.cpp:14:注意：为别名检查创建了1个版本控制。
test.cpp:14：注：循环矢量化。
test.cpp:4：注意：函数中的向量化1循环。
编译于2月16日星期三19:17:59完成

它不喜欢外部循环格式，因为它无法理解内部循环。如果我把它折叠成一个循环，我就可以使它矢量化：

#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
    const float * __restrict__ input = malloc(20000*sizeof(float));
    float * __restrict__ output = malloc(20000*sizeof(float));

    for(unsigned int i=0; i<=10100; i++) {
            output[i] = input[i] * 0.1f;
    }
}

#包括
#包括
int main（int argc，字符**argv）{
常量浮点*限制输入=malloc（20000*sizeof（浮点））；
浮点*限制输出=malloc（20000*sizeof（浮点））；
对于（unsigned int i=0；i而言，这显然像是一个bug。在以下等效函数中，为x86-64目标编译时，foo（）
是矢量化的，但bar（）
不是：
void foo(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    for (pos = 0; pos < 10100; pos++)
        output[pos] = input[pos] * 0.1;
}

void bar(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    unsigned int i;
    for (pos = 0; pos <= 10000; pos += 100)
        for (i = 0; i < 100; i++)
            output[pos + i] = input[pos + i] * 0.1;
}

void foo（常量浮点*限制输入，浮点*限制输出）
{
无符号整数位置；
用于（pos=0；pos<10100；pos++）
输出[pos]=输入[pos]*0.1；
}
空栏（常量浮点*限制输入，浮点*限制输出）
{
无符号整数位置；
无符号整数i；
对于（pos=0；pos什么版本的gcc？一个工作示例可能也很有用，因为当我尝试使用4.4.5版本时，它被矢量化了。你可以发布编译的示例代码吗？当我填写一些伪值时，循环被矢量化了…我已经更新了示例以使其可编译。它仍然不会矢量化。我正在使用“gcc”（Debian 4.4.5-10） 4.4.5"这样做的理由是什么？@Oli只是一个猜测，可能是他的编译器不喜欢额外的常量或\u restrictform@Jeremy看到我的更新，它不喜欢运行时开始绑定。我猜它认为pos可能有别名。谢谢！我在一个64位平台上！使用-m32使它工作得很好。我现在正在提交一个错误报告其他答案很好，但实际上只是权宜之计，因为这不需要修改就可以工作。请注意，32位可执行文件可能比非矢量化的64位文件慢得多，因此除非您的目标是纯粹的“使用SSE”你应该分析你的整个应用程序。谢谢Ben，我实际上并没有用它来编译我的代码，只是用来归档错误报告。我可以通过稍微重新安排一下，让它在64位上正确矢量化。
#include <stdlib.h>
#include <math.h>
int main(int argc, char ** argv) {
    const float * __restrict__ input = malloc(20000*sizeof(float));
    float * __restrict__ output = malloc(20000*sizeof(float));

    for(unsigned int i=0; i<=10100; i++) {
            output[i] = input[i] * 0.1f;
    }
}

void foo(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    for (pos = 0; pos < 10100; pos++)
        output[pos] = input[pos] * 0.1;
}

void bar(const float * restrict input, float * restrict output)
{
    unsigned int pos;
    unsigned int i;
    for (pos = 0; pos <= 10000; pos += 100)
        for (i = 0; i < 100; i++)
            output[pos + i] = input[pos + i] * 0.1;
}