C++ 为什么std:：fill（0）比std:：fill（1）慢？_C++_Performance_X86_Compiler Optimization_Memset

C++ 为什么std:：fill（0）比std:：fill（1）慢？

c++ performance x86

C++ 为什么std:：fill（0）比std:：fill（1）慢？,c++,performance,x86,compiler-optimization,memset,C++,Performance,X86,Compiler Optimization,Memset,我在一个系统上观察到，当设置定值0时，与定值1或动态值相比，大型std:：vector上的std:：fill明显且始终较慢： 5.8 GiB/s vs 7.5 GiB/s 但是，对于较小的数据大小，结果是不同的，其中fill（0）更快：对于多个线程，在4 GiB数据大小下，fill（1）显示更高的斜率，但达到的峰值远低于fill（0）（51 GiB/s vs 90 GiB/s）：这就提出了第二个问题，为什么fill（1）的峰值带宽要低得多这方面的测试系统是一个双插槽Intel Xeon

我在一个系统上观察到，当设置定值

时，与定值

或动态值相比，大型

std:：vector

上的

std:：fill

明显且始终较慢：

5.8 GiB/s vs 7.5 GiB/s

但是，对于较小的数据大小，结果是不同的，其中

fill（0）

更快：

对于多个线程，在4 GiB数据大小下，

fill（1）

显示更高的斜率，但达到的峰值远低于

fill（0）

（51 GiB/s vs 90 GiB/s）：

这就提出了第二个问题，为什么

fill（1）

的峰值带宽要低得多

这方面的测试系统是一个双插槽Intel Xeon CPU E5-2680 v3，设置为2.5 GHz（通过

/sys/cpufreq

），带有8x16 GiB DDR4-2133。我使用GCC 6.1.0（

-O3

）和英特尔编译器17.0.1（

-fast

）进行了测试，两者都得到了相同的结果<已设置代码>GOMP_CPU_亲和性=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23。Strem/add/24线程在系统上获得85 GiB/s

我能够在不同的Haswell双套接字服务器系统上重现这种效果，但没有任何其他体系结构。例如，在Sandy Bridge EP上，内存性能是相同的，而在缓存中

fill（0）

要快得多

下面是要复制的代码：

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>

using value = int;
using vector = std::vector<value>;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute__((noinline)) fill0(vector& v) {
    std::fill(v.begin(), v.end(), 0);
}

void __attribute__((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}

void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
    {
        vector v(data_size / (sizeof(value) * nthreads));
        auto repeat = write_size / data_size;
#pragma omp barrier
        auto t0 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill0(v);
#pragma omp barrier
        auto t1 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill1(v);
#pragma omp barrier
        auto t2 = omp_get_wtime();
#pragma omp master
        std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
                  << write_size / (t2 - t1) << "\n";
    }
}

int main(int argc, const char* argv[]) {
    std::cout << "size,nthreads,fill0,fill1\n";
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, 1);
    }
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, omp_get_max_threads());
    }
    for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
        bench(max_data_size, nthreads);
    }
}

#包括
#包括
#包括
#包括
#包括
使用value=int；
使用vector=std:：vector；
constexpr size\u t write\u size=8ll*1024*1024*1024；
constexpr size\u t max\u data\u size=4ll*1024*1024*1024；
void _属性_（（noinline））fill0（vector&v）{
std:：fill（v.begin（），v.end（），0）；
}
void _属性_（（noinline））fill1（vector&v）{
std:：fill（v.begin（），v.end（），1）；
}
空台（尺寸数据尺寸，整数）{
#pragma omp并行num_线程（n线程）
{
向量v（数据大小/（大小f（值）*nthreads））；
自动重复=写入大小/数据大小；
#布拉格奥姆普屏障
自动t0=omp_get_wtime（）；
用于（自动r=0；rstd：：cout我将分享我的初步发现，希望鼓励更详细的答案。我只是觉得这太像问题本身的一部分了
编译器将fill（0）
优化为内部memset
。它不能对fill（1）
进行同样的优化，因为memset
只对字节起作用
具体而言，glibcs\uuuuu memset\u avx2
和\uuuu intel\u avx\u rep\u memset
都是通过一条热指令实现的：
rep    stos %al,%es:(%rdi)

add    $0x1,%rax                                                                                                       
add    $0x10,%rdx                                                                                                      
movaps %xmm0,-0x10(%rdx)                                                                                               
cmp    %rax,%r8                                                                                                        
ja     400f41

其中，手动循环编译为实际的128位指令：
rep    stos %al,%es:(%rdi)

add    $0x1,%rax                                                                                                       
add    $0x10,%rdx                                                                                                      
movaps %xmm0,-0x10(%rdx)                                                                                               
cmp    %rax,%r8                                                                                                        
ja     400f41

有趣的是，虽然有一个模板/头优化可以通过memset
为字节类型实现std:：fill，但在这种情况下，转换实际循环是一个编译器优化。
奇怪的是，对于std:：vector
，gcc也开始优化fill（1）
。尽管有memset
模板规范，英特尔编译器却没有
由于这只发生在代码实际在内存而不是缓存中工作时，因此Haswell EP体系结构似乎无法有效整合单字节写入
我将感谢您对该问题和相关微体系结构细节的进一步了解。我尤其不清楚为什么四个或更多线程的行为如此不同，以及为什么memset
在缓存中的速度如此之快
更新：
这是一个与之相比的结果

使用-march=native
（avx2vmovdq%ymm0
）的填充（1）-它在L1中工作得更好，但与其他内存级别的movaps%xmm0
版本类似
32、128和256位非时态存储的变体。无论数据大小如何，它们都以相同的性能一致地执行。所有变体的性能都优于内存中的其他变体，特别是对于少量线程。128位和256位的性能完全相似，对于少量线程，32位的性能要差得多

对于来自您的问题+编译器根据您的答案生成asm：

fill（0）
是一种在优化的微码循环中使用256b存储的方法。（如果缓冲区对齐，效果最好，可能至少为32B或64B）
fill（1）
是一个简单的128位movaps
向量存储循环。无论宽度如何，每个核心时钟周期只能执行一个存储，最高可达256b AVX。因此128b存储只能填充Haswell的L1D缓存写带宽的一半。这就是为什么fill（0）
对于高达32kiB的缓冲区，速度大约是原来的2倍。请使用-march=haswell
或-march=native
进行编译，以解决此问题
Haswell只能勉强跟上循环开销，但它仍然可以在每个时钟上运行1个存储，即使它根本没有展开。但是，由于每个时钟上有4个融合域UOP，这会在无序窗口中占用大量的填充空间。一些展开可能会让TLB未命中开始在存储发生之前解决，s因为存储地址UOP的吞吐量大于存储数据的吞吐量。展开可能有助于弥补ERMSB和适合L1D的缓冲区的向量循环之间的其余差异。（关于这个问题的评论说，-march=native
只对L1起到了填充（1）
的作用。）


请注意，rep movsd
（可用于实现int
元素的fill（1）
）可能会在Haswell上执行与rep stosb相同的操作。
虽然只有t