GCC向量扩展Sqrt_Gcc_Vectorization

GCC向量扩展Sqrt

gcc

GCC向量扩展Sqrt,gcc,vectorization,Gcc,Vectorization,我目前正在尝试使用。然而，我想知道如何让sqrt（vec）按预期工作例如： typedef double v4d __attribute__ ((vector_size (16))); v4d myfunc(v4d in) { return some_sqrt(in); } 至少在最近的x86系统上，它会发出对相关内在sqrtpd的调用。是否有一个适用于sqrt的GCC内置项可以处理向量类型，或者需要降到内部级别才能完成此任务？看起来这是一个bug：除了按组件进行操作之外，我不知道还

我目前正在尝试使用。然而，我想知道如何让

sqrt（vec）

按预期工作

例如：

typedef double v4d __attribute__ ((vector_size (16)));
v4d myfunc(v4d in)
{
    return some_sqrt(in);
}

至少在最近的x86系统上，它会发出对相关内在sqrtpd的调用。是否有一个适用于sqrt的GCC内置项可以处理向量类型，或者需要降到内部级别才能完成此任务？

看起来这是一个bug：除了按组件进行操作之外，我不知道还有其他解决方法。无论如何，向量扩展从未打算取代特定于平台的内部函数

这方面的一些时髦代码：

#include <cmath>

#include <utility>

template <::std::size_t...> struct indices { };

template <::std::size_t M, ::std::size_t... Is>
struct make_indices : make_indices<M - 1, M - 1, Is...> {};

template <::std::size_t... Is>
struct make_indices<0, Is...> : indices<Is...> {};

typedef float vec_type __attribute__ ((vector_size(4 * sizeof(float))));

template <::std::size_t ...Is>
vec_type sqrt_(vec_type const& v, indices<Is...> const)
{
  vec_type r;

  ::std::initializer_list<int>{(r[Is] = ::std::sqrt(v[Is]), 0)...};

  return r;
}

vec_type sqrt(vec_type const& v)
{
  return sqrt_(v, make_indices<4>());
}

int main()
{
  vec_type v;

  return sqrt(v)[0];
}

#包括
#包括
模板结构索引{}；
模板
结构make_索引：make_索引{}；
模板
结构make_索引：索引{}；
typedef float向量类型属性（（向量大小（4*sizeof（float）））；
模板
向量类型sqrt（向量类型常量和v，索引常量）
{
vec_型r；
：：std:：initializer_list{（r[Is]=：：std:：sqrt（v[Is]），0）；
返回r；
}
车辆类型sqrt（车辆类型const&v）
{
返回sqrt_v（make_index（））；
}
int main（）
{
v型vec_；
返回sqrt（v）[0]；
}

您还可以试试自动矢量化，这与矢量扩展是分开的。

您可以直接在矢量上循环

#include <math.h>
typedef double v2d __attribute__ ((vector_size (16)));   
v2d myfunc(v2d in) {
    v2d out;
    for(int i=0; i<2; i++) out[i] = sqrt(in[i]);
    return out;
}

#包括
typedef双v2d属性（向量大小（16））；
v2d myfunc（v2d-in）{
v2d输出；
对于（int i=0；i我对问题的理解是，您需要4个压缩双精度值的平方根…即32
字节。使用适当的AVX内在函数：
#include <x86intrin.h>

typedef double v4d __attribute__ ((vector_size (32)));
v4d myfunc (v4d v) {
    return _mm256_sqrt_pd(v);
}

ymm0
是返回值寄存器。

也就是说，，恰好有一个内置的：\uuuuuBuiltin\uIA32\uSQRTPD256
，它不需要intrinsics头。但是，我绝对不鼓励使用它。
就像对数组那样做就行了。考虑到有一条专门为sqrt root vect设计的ISA指令，这有点不太理想ors的速度可能是两个标量平方根的两倍。看起来这是一个bug：除了按组件进行操作外，我不知道还有什么其他解决方法。无论如何，向量扩展从来没有打算取代特定于平台的内部函数。作为答案发布，我非常乐意接受它作为解决方案。-没有数学错误GCC的激烈解决方案，但不是叮当声
myfunc:
  vsqrtpd %ymm0, %ymm0 # (or just `ymm0` for Intel syntax)
  ret