Warning: file_get_contents(/data/phpspider/zhask/data//catemap/6/cplusplus/145.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181

Warning: file_get_contents(/data/phpspider/zhask/data//catemap/0/performance/5.json): failed to open stream: No such file or directory in /data/phpspider/zhask/libs/function.php on line 167

Warning: Invalid argument supplied for foreach() in /data/phpspider/zhask/libs/tag.function.php on line 1116

Notice: Undefined index: in /data/phpspider/zhask/libs/function.php on line 180

Warning: array_chunk() expects parameter 1 to be array, null given in /data/phpspider/zhask/libs/function.php on line 181
C++ 在数组/向量中存储_m128的正确方法是迭代,而不必从/到标量浮点加载/存储_C++_Performance_Containers_Simd_Intrinsics - Fatal编程技术网

C++ 在数组/向量中存储_m128的正确方法是迭代,而不必从/到标量浮点加载/存储

C++ 在数组/向量中存储_m128的正确方法是迭代,而不必从/到标量浮点加载/存储,c++,performance,containers,simd,intrinsics,C++,Performance,Containers,Simd,Intrinsics,问题摘要: 在C++中,我试图找出在代码中存储α-M128 < /C>变量的最佳方法,如数组> /COD>或 STD::向量< /代码>,以便不必调用昂贵的 问题的长版本: 一种常用的方法,用C++中的SIMD SSE2内联函数来迭代浮点数容器,看起来如下: std::vector<float> X1[N]; std::vector<float> X2[N]; std::vector<float> Y[N]; // any code to fill X1

问题摘要:

在C++中,我试图找出在代码中存储<代码>α-M128 < /C>变量的最佳方法,如<代码>数组> /COD>或<代码> STD::向量< /代码>,以便不必调用昂贵的<代码>
问题的长版本:

一种常用的方法,用C++中的SIMD SSE2内联函数来迭代浮点数容器,看起来如下:

std::vector<float> X1[N];
std::vector<float> X2[N];
std::vector<float> Y[N];

// any code to fill X1 and X2 with float scalar values

for(int i=0; i<N; i+=4)
{
    __m128 _x1 = _mm_load_ps(&X1[i]);
    __m128 _x2 = _mm_load_ps(&X2[i]);
    __m128 _sum = _mm_add_ps(_x1, _x2);
    __m128 _sqrt = _mm_sqrt_ps(_sum);
    _mm_store_ps(&Y[i], _sum);
}
我尝试实现这一点,但令我惊讶的是,它最终导致的性能时间与前面的SIMD代码大致相同,或者通常甚至更低。因此,我试图理解这两个方面:

1) 为什么会这样?我期待着相关的性能改进,因为循环中的任何地方都没有进行加载/存储(这是代码中对我的性能很重要的部分)。我在示例实现中是否做了一些错误的事情(请参见下面的代码)

2) 使用
\uuuu m128
容器的正确方法是什么,以避免从标量到矢量再回到标量的加载和存储,以防需要迭代大量的值并可以直接使用矢量化数据

这是我真正的代码实现,已经在进行性能度量:

#include <random>
#include <chrono>
#include <iostream>
#include <xmmintrin.h>
#include <vector>

using namespace std;

const int N = 100000;
__declspec(align(16)) vector<float> X1;
__declspec(align(16)) vector<float> X2;
__declspec(align(16)) vector<float> Y(N);

__declspec(align(16)) vector<__m128> X1v;
__declspec(align(16)) vector<__m128> X2v;
__declspec(align(16)) vector<__m128> Yv(N / 4);

int main() {

    default_random_engine randomGenerator(time(0));
    uniform_real_distribution<float> diceroll(0.0f, 100.0f);

    for (int i = 0; i < N; i++) X1.push_back(diceroll(randomGenerator));
    for (int i = 0; i < N; i++) X2.push_back(diceroll(randomGenerator));
    for (int i = 0; i < N; i += 4)
    {
        __m128 _x1v = _mm_loadu_ps(&X1[i]);
        __m128 _x2v = _mm_loadu_ps(&X2[i]);
        X1v.push_back(_x1v);
        X2v.push_back(_x2v);
    }

    chrono::high_resolution_clock::time_point c1start, c1end, c2start, c2end;

    c1start = chrono::high_resolution_clock::now();
    for (int i = 0; i<N; i += 4)
    {
        __m128 _x1 = _mm_load_ps(&X1[i]);
        __m128 _x2 = _mm_load_ps(&X2[i]);
        __m128 _sum = _mm_add_ps(_x1, _x2);
        __m128 _sqrt = _mm_sqrt_ps(_sum);
        _mm_store_ps(&Y[i], _sqrt);
    }
    c1end = chrono::high_resolution_clock::now();

    c2start = chrono::high_resolution_clock::now();
    for (int j = 0; j<(N / 4); j++)
    {
        __m128 _x1 = X1v[j];
        __m128 _x2 = X2v[j];
        __m128 _sum = _mm_add_ps(_x1, _x2);
        __m128 _sqrt = _mm_sqrt_ps(_sum);
        Yv[j] = _sqrt;
    }
    c2end = chrono::high_resolution_clock::now();


    auto c1 = chrono::duration_cast<chrono::microseconds>(c1end - c1start).count();
    auto c2 = chrono::duration_cast<chrono::microseconds>(c2end - c2start).count();

    cout << "With arrays of floats and loads/stores: " << (float)c1 << endl;
    cout << "With arrays of __m128 and no loads or stores: " << (float)c2 << endl;
    cout << "Ratio arrays-of-m128: / arrays-of-floats " << (float)c2 / (float)c1 << endl;

    cout << endl;
    system("pause");
    return 0;
}
#包括
#包括
#包括
#包括
#包括
使用名称空间std;
常数N=100000;
__declspec(align(16))向量X1;
__declspec(align(16))向量X2;
__declspec(align(16))向量Y(N);
__declspec(align(16))向量X1v;
__declspec(align(16))向量X2v;
__declspec(align(16))向量Yv(N/4);
int main(){
默认的随机引擎随机生成器(时间(0));
均匀分布(0.0f,100.0f);
对于(int i=0;i对于(int i=0;ii如果你查看编译后的代码,你会发现
\u mm\u load\u ps
\u mm\u store\u ps
编译到加载和存储指令。如果你试图从
std::vector
读/写,你无论如何都需要加载/存储。因此从这个意义上说,那些内部函数并不昂贵。另一方面,你有一个更大的问题。对齐向量对象不对齐元素。除非指定了自己的分配器,否则无法对齐对象。@ MyStuple感谢评论。关于对齐,当然:我正处于实现自定义分配器的过程中,当我注意到我在问的问题时。关于存储/负载:I。F我理解正确,你所说的是,将C++中的任何容器包含/检索元素的操作归结为保存/存储/汇编命令,这些命令用于将/Cuff< <代码> > <代码> > 128 < /代码>,还是只有“代码> STD::vector < /代码>的情况下,在这种情况下,有办法来解决。相信我所描述的吗?在编译器优化和所有的事情之后,(对齐)加载/存储内部函数与赋值没有什么不同。如果需要加载或存储,编译器会发出必要的加载/存储指令。如果只是寄存器副本,编译器会这样做。我基本上是说,正常的加载/存储内部函数除了作为一种自文档化代码的形式使其可用外,是无用的向您试图加载/存储的读者解释。
#include <random>
#include <chrono>
#include <iostream>
#include <xmmintrin.h>
#include <vector>

using namespace std;

const int N = 100000;
__declspec(align(16)) vector<float> X1;
__declspec(align(16)) vector<float> X2;
__declspec(align(16)) vector<float> Y(N);

__declspec(align(16)) vector<__m128> X1v;
__declspec(align(16)) vector<__m128> X2v;
__declspec(align(16)) vector<__m128> Yv(N / 4);

int main() {

    default_random_engine randomGenerator(time(0));
    uniform_real_distribution<float> diceroll(0.0f, 100.0f);

    for (int i = 0; i < N; i++) X1.push_back(diceroll(randomGenerator));
    for (int i = 0; i < N; i++) X2.push_back(diceroll(randomGenerator));
    for (int i = 0; i < N; i += 4)
    {
        __m128 _x1v = _mm_loadu_ps(&X1[i]);
        __m128 _x2v = _mm_loadu_ps(&X2[i]);
        X1v.push_back(_x1v);
        X2v.push_back(_x2v);
    }

    chrono::high_resolution_clock::time_point c1start, c1end, c2start, c2end;

    c1start = chrono::high_resolution_clock::now();
    for (int i = 0; i<N; i += 4)
    {
        __m128 _x1 = _mm_load_ps(&X1[i]);
        __m128 _x2 = _mm_load_ps(&X2[i]);
        __m128 _sum = _mm_add_ps(_x1, _x2);
        __m128 _sqrt = _mm_sqrt_ps(_sum);
        _mm_store_ps(&Y[i], _sqrt);
    }
    c1end = chrono::high_resolution_clock::now();

    c2start = chrono::high_resolution_clock::now();
    for (int j = 0; j<(N / 4); j++)
    {
        __m128 _x1 = X1v[j];
        __m128 _x2 = X2v[j];
        __m128 _sum = _mm_add_ps(_x1, _x2);
        __m128 _sqrt = _mm_sqrt_ps(_sum);
        Yv[j] = _sqrt;
    }
    c2end = chrono::high_resolution_clock::now();


    auto c1 = chrono::duration_cast<chrono::microseconds>(c1end - c1start).count();
    auto c2 = chrono::duration_cast<chrono::microseconds>(c2end - c2start).count();

    cout << "With arrays of floats and loads/stores: " << (float)c1 << endl;
    cout << "With arrays of __m128 and no loads or stores: " << (float)c2 << endl;
    cout << "Ratio arrays-of-m128: / arrays-of-floats " << (float)c2 / (float)c1 << endl;

    cout << endl;
    system("pause");
    return 0;
}