C++ 迭代时的性能（缓存未命中）_C++_Caching_Vector_Iterator

C++ 迭代时的性能（缓存未命中）

c++ caching vector

C++ 迭代时的性能（缓存未命中）,c++,caching,vector,iterator,C++,Caching,Vector,Iterator,我发现迭代通过向量的速度比不是使用变量（i）进行计数，而是使用std:：vector:：iterator 由于一些评论，这里有一些额外的信息：（1）我使用Visual Studio C++编译器；（2）我在发布模式下编译，并使用优化-O2:）如果变量i递增，则迭代将 5875ms： std::vector<Data> vec(MAX_DATA); stopWatch.start(); for (unsigned i = 0U; i < MAX_DATA; ++i) {

我发现迭代通过向量的速度比不是使用变量（i）进行计数，而是使用

std:：vector:：iterator

由于一些评论，这里有一些额外的信息：（1）我使用Visual Studio C++编译器；（2）我在发布模式下编译，并使用优化-O2:）

如果变量i递增，则迭代将

5875ms：

std::vector<Data> vec(MAX_DATA);
stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data are stored in memory next to each other");

std:：向量向量向量（MAX_数据）；
秒表。开始（）；
对于（无符号i=0U；i


或5723ms：
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec2[i]->x = 0;
    vec2[i]->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

std:：vectorvec2；
对于（无符号i=0U；ix=0；
vec2[i]>y=0；
}
秒表；
stopWatch.printSpanAsMs（“数据在内存中的任意位置”）；

如果使用std:：vector:：Iterator
进行迭代，则迭代将需要
29ms：
std:：向量向量向量（MAX_数据）；
秒表。开始（）；
用于（自动和it:vec）{
it.x=0；
it.y=0；
}
秒表；
stopWatch.printSpanAsMs（“数据存储在相邻的内存中”）；

或110毫秒：
std::vector<Data*> vec2;
for (unsigned i = 0U; i < MAX_DATA; ++i)
    vec2.push_back(new Data());

stopWatch.start();
for (auto& it : vec2) {
    it->x = 0;
    it->y = 0;
}
stopWatch.stop();
stopWatch.printSpanAsMs("The data is in memory at a random position");

std:：vectorvec2；
对于（无符号i=0U；ix=0；
it->y=0；
}
秒表；
stopWatch.printSpanAsMs（“数据在内存中的任意位置”）；

为什么另一个迭代要快得多
我想知道，变量I的迭代，数据在内存中的不同位置，与变量I的迭代一样快，变量I的数据在内存中并置。
数据在内存中彼此相邻的事实应该会减少缓存未命中，并且可以与std:：vector:：Iterator进行迭代，为什么不与另一个进行迭代呢？
或者我敢说，29到110毫秒的距离不是欠债的缓存未命中吗
整个程序如下所示：
#include <iostream>
#include <chrono>
#include <vector>
#include <string>

class StopWatch
{
public:
    void start() {
        this->t1 = std::chrono::high_resolution_clock::now();
    }

    void stop() {
        this->t2 = std::chrono::high_resolution_clock::now();
        this->diff = t2 - t1;
    }

    void printSpanAsMs(std::string startText = "time span") {
        long diffAsMs = std::chrono::duration_cast<std::chrono::milliseconds>
        (diff).count();
        std::cout << startText << ": " << diffAsMs << "ms" << std::endl;
    }
private:
    std::chrono::high_resolution_clock::time_point t1, t2;
    std::chrono::high_resolution_clock::duration   diff;
} stopWatch;

struct Data {
    int x, y;
};

const unsigned long MAX_DATA = 20000000;

void test1()
{
    std::cout << "1. Test \n Use i to iterate through the vector" << 
    std::endl;

    std::vector<Data> vec(MAX_DATA);
    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec[i].x = 0;
        vec[i].y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        vec2[i]->x = 0;
        vec2[i]->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (unsigned i = 0U; i < MAX_DATA; ++i) {
        delete vec2[i];
        vec2[i] = nullptr;
    }
}

void test2()
{
    std::cout << "2. Test \n Use std::vector<T>::iteraror to iterate through 
    the vector" << std::endl;

    std::vector<Data> vec(MAX_DATA);

    stopWatch.start();
    for (auto& it : vec) {
        it.x = 0;
        it.y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data are stored in memory next to each 
    other");

    //////////////////////////////////////////////////

    std::vector<Data*> vec2;
    for (unsigned i = 0U; i < MAX_DATA; ++i)
        vec2.push_back(new Data());

    stopWatch.start();
    for (auto& it : vec2) {
        it->x = 0;
        it->y = 0;
    }
    stopWatch.stop();
    stopWatch.printSpanAsMs("The data is in memory at a random position");

    for (auto& it : vec2) {
        delete it;
        it = nullptr;
    }
}

int main()
{
    test1();
    test2();

    system("PAUSE");
    return 0;
}

#包括
#包括
#包括
#包括
阶级秒表
{
公众：
void start（）{
这->t1=std:：chrono:：高分辨率时钟：：现在（）；
}
无效停止（）{
这->t2=std:：chrono:：高分辨率时钟：：现在（）；
这->差异=t2-t1；
}
无效打印span asms（std:：string startText=“时间跨度”）{
长diffAsMs=std:：chrono:：duration\u cast
（diff.count（）；
标准：：cout
为什么另一个迭代要快得多
原因是MSVC 2017无法对其进行适当优化
在第一种情况下，它完全无法优化循环：
for (unsigned i = 0U; i < MAX_DATA; ++i) {
    vec[i].x = 0;
    vec[i].y = 0;
}

用size\u ti
替换unsigned i
，或者将索引访问提升到引用中都没有帮助（）
唯一有帮助的是使用迭代器，就像您已经发现的那样：
for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}

生成的代码（）：
在这两种情况下，clang只调用memset

故事的寓意是：如果您关心性能，请查看生成的代码。向供应商报告问题。
您使用的是什么编译器？您使用的是什么标志？@SOUser:visual studio不是编译器。您没有提到是否启用了优化。如果没有优化，您的基准测试将毫无意义。应该使用Unsignedg？我只使用大于0的数字，所以我使用了unsigned。这有什么问题？在本例中，它是唯一一种禁止编译器优化循环以调用memset
，因为它必须“正确”的类型如果数字太大，请环绕。请参阅上面链接中生成的程序集。一般来说，我建议您使用int
，除非它可能太小，否则使用int64\t
；对于除位魔法以外的所有内容。为此，请使用所需的任何固定大小的无符号程序集。我知道这是有争议的，但我并不孤单：9:50，42:40，1:02:50正确，但是像int
这样的溢出有符号类型调用UB，优化器可以并且确实假设不会发生这种情况。当然，这意味着如果索引溢出，无论是否有符号，您都会有一个bug，所以请确保选择适当的类型。如我所说，int64\t如果有疑问，请回答。谢谢您的回答。谢谢我不会汇编，但代码看起来要小得多。“如果你用size_t I替换无符号I，代码看起来会更好”是的，它看起来更好，但在性能方面不会改变任何东西。底线是MSVC在不使用迭代器的情况下生成低效的代码。你不能“修复”它，只有Microsoft可以（可能在下一个版本中）。
        xor      r9d, r9d
        mov      eax, r9d
$LL4@test1:
        mov      rdx, QWORD PTR [rcx]
        lea      rax, QWORD PTR [rax+16]
        mov      DWORD PTR [rax+rdx-16], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-12], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-8], r9d
        mov      rdx, QWORD PTR [rcx]
        mov      DWORD PTR [rax+rdx-4], r9d
        sub      r8, 1
        jne      SHORT $LL4@test1

for (auto& it : vec) {
    it.x = 0;
    it.y = 0;
}

        xor      ecx, ecx
        npad     2
$LL4@test2:
        mov      QWORD PTR [rax], rcx
        add      rax, 8
        cmp      rax, rdx
        jne      SHORT $LL4@test2