C++ 在二维数组中迭代不完整的行时，会出现奇怪的性能下降_C++_Optimization_Benchmarking_Arm64_Cpu Cache

C++ 在二维数组中迭代不完整的行时，会出现奇怪的性能下降

c++ optimization

C++ 在二维数组中迭代不完整的行时，会出现奇怪的性能下降,c++,optimization,benchmarking,arm64,cpu-cache,C++,Optimization,Benchmarking,Arm64,Cpu Cache,在优化Cortex-A53（aarch64）的部分代码时，我遇到了一个奇怪的缓存行为在简化的情况下（见下文），我从2D数组中逐行读取数字。奇怪的是，如果我不遍历整行，但省略了最后一部分，处理时间就会增加。而且，从最后一行到第一行时，没有观察到这种效果下面的代码显示了使用google基准测试框架时的这种行为 #include <iostream> #include <benchmark/benchmark.h> constexpr int CACHE_LINE_SIZ

在优化Cortex-A53（aarch64）的部分代码时，我遇到了一个奇怪的缓存行为

在简化的情况下（见下文），我从2D数组中逐行读取数字。奇怪的是，如果我不遍历整行，但省略了最后一部分，处理时间就会增加。而且，从最后一行到第一行时，没有观察到这种效果

下面的代码显示了使用google基准测试框架时的这种行为

#include <iostream>
#include <benchmark/benchmark.h>

constexpr int CACHE_LINE_SIZE = 64;

using ElementType = int32_t;
constexpr int BUFFER_H = 1024;
constexpr int BUFFER_W = 4096; // in bytes

template <int X_END_BYTES> void BM_go_forward(benchmark::State& state)
{
    static __attribute__((aligned(CACHE_LINE_SIZE)))
    ElementType data[BUFFER_H][BUFFER_W / sizeof(ElementType)] = {5};
    int X_END = X_END_BYTES / sizeof(ElementType);
    for (auto _: state)
    {
        auto tmp = 0;
        for (int yc = 0; yc < BUFFER_H; ++yc)
        {
            for (int x = 0; x < X_END; ++x)
            {
                tmp += data[yc][x];
                tmp /= 2;
            }
        }
        benchmark::DoNotOptimize(tmp);
    }
}

template <int X_END_BYTES> void BM_go_backward(benchmark::State& state)
{
    static __attribute__((aligned(CACHE_LINE_SIZE)))
    ElementType data[BUFFER_H][BUFFER_W / sizeof(ElementType)] = {5};
    int X_END = X_END_BYTES / sizeof(ElementType);
    for (auto _: state)
    {
        auto tmp = 0;
        int y = BUFFER_H-1;
        for (int yc = 0; yc < BUFFER_H; ++yc, --y)
        {
            for (int x = 0; x < X_END; ++x)
            {
                tmp += data[y][x];
                tmp /= 2;
            }
        }
        benchmark::DoNotOptimize(tmp);
    }
}


constexpr int ITERS = 100;

BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 1)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 2)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 4)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 6)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 8)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 10)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 12)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 14)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 16)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 20)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 24)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 28)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_forward, BUFFER_W - CACHE_LINE_SIZE * 32)->Iterations(ITERS);

BENCHMARK_TEMPLATE(BM_go_backward, BUFFER_W)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_backward, BUFFER_W - CACHE_LINE_SIZE * 1)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_backward, BUFFER_W - CACHE_LINE_SIZE * 2)->Iterations(ITERS);
BENCHMARK_TEMPLATE(BM_go_backward, BUFFER_W - CACHE_LINE_SIZE * 4)->Iterations(ITERS);

Cpu缩放已禁用。在分析过程中，进程作为实时进程运行。代码是用-O3编译的。在基准函数的不同实例中，我观察到内部循环的汇编代码没有差异

我怀疑这种行为与一级缓存的预取器有关，但我不是缓存方面的专家。有人能解释这种性能下降的原因吗？为什么只有在以正向顺序而不是反向顺序访问行时才会观察到这种情况？

您使用的是什么AArch64微体系结构？皮质A76？皮质-A53？苹果的一个？但无论如何，是的，我认为这是因为硬件预取仍然花费时间预取您最终不会使用的数据。它不会“看到”您的读取流停止，直到已经太晚无法获取它为止。所以你的一部分带宽被浪费在了你的程序无法触及的地方。或者它没有通过间隙，必须根据新行中的需求未命中情况重新启动。我发现对于较小的问题，总时间要高一些，不仅仅是perf/float。但这两个都不能解释为什么向后读取可以分别避免每一行的问题。堆栈上的数组可能会以页面对齐或接近对齐结束？如果HW预取在页面边界处停止（这很常见，因为虚拟页面可能不会映射到物理上连续的内存），则每行中读取的最后一个元素可能是页面中的最后一个元素（在该方向），每次都会在理想位置停止HW预取。或者每隔一次，8k页中有4k行。也许尝试宽度不是2的幂？行大小为3072，效果类似。@PeterCordes您说过“或者它没有通过间隙，必须根据新行中的需求未命中重新启动。”，但不仅如此。在

情况下，预取器也会重新启动，但在这种情况下，我们的预取次数比

要多（处理时间要短得多）。出于某些原因，小差距比大差距更糟糕。但是。。。您还提到了分页，而且似乎涉及到了分页。如果内存与页面大小对齐，此效果将完全消失！

--------------------------------------------------------------------------------------------------------
Benchmark                                                              Time             CPU   Iterations
--------------------------------------------------------------------------------------------------------
BM_go_forward<BUFFER_W>/iterations:100                              5.67 ms         5.67 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 1>/iterations:100        6.53 ms         6.53 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 2>/iterations:100        6.46 ms         6.46 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 4>/iterations:100        7.22 ms         7.22 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 6>/iterations:100        7.02 ms         7.02 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 8>/iterations:100        6.77 ms         6.77 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 10>/iterations:100       6.51 ms         6.50 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 12>/iterations:100       5.94 ms         5.94 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 14>/iterations:100       4.69 ms         4.69 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 16>/iterations:100       4.50 ms         4.50 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 20>/iterations:100       4.17 ms         4.17 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 24>/iterations:100       3.84 ms         3.84 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 28>/iterations:100       3.48 ms         3.48 ms          100
BM_go_forward<BUFFER_W - CACHE_LINE_SIZE * 32>/iterations:100       3.14 ms         3.14 ms          100
BM_go_backward<BUFFER_W>/iterations:100                             6.00 ms         6.00 ms          100
BM_go_backward<BUFFER_W - CACHE_LINE_SIZE * 1>/iterations:100       5.81 ms         5.81 ms          100
BM_go_backward<BUFFER_W - CACHE_LINE_SIZE * 2>/iterations:100       5.75 ms         5.75 ms          100
BM_go_backward<BUFFER_W - CACHE_LINE_SIZE * 4>/iterations:100       5.53 ms         5.53 ms          100

CPU: ARM Cortex-A53, speed 1400 MHz (estimated)
Counted L1D_CACHE_REFILL events (Level 1 data cache refill) with a unit mask of 0x00 (No unit mask) count 10007
Counted PREFETCH_LINEFILL events (Linefill because of prefetch) with a unit mask of 0x00 (No unit mask) count 10007
samples  %        samples  %        image name               symbol name
31        1.6316  639       6.7214  all_benchmarks           void BM_go_forward<4096>(benchmark::State&)
121       6.3684  545       5.7326  all_benchmarks           void BM_go_forward<4032>(benchmark::State&)
123       6.4737  544       5.7221  all_benchmarks           void BM_go_forward<3968>(benchmark::State&)
213      11.2105  454       4.7754  all_benchmarks           void BM_go_forward<3840>(benchmark::State&)
207      10.8947  453       4.7649  all_benchmarks           void BM_go_forward<3712>(benchmark::State&)
200      10.5263  456       4.7965  all_benchmarks           void BM_go_forward<3584>(benchmark::State&)
193      10.1579  463       4.8701  all_benchmarks           void BM_go_forward<3456>(benchmark::State&)
147       7.7368  491       5.1646  all_benchmarks           void BM_go_forward<3328>(benchmark::State&)
48        2.5263  583       6.1323  all_benchmarks           void BM_go_forward<3200>(benchmark::State&)
50        2.6316  557       5.8588  all_benchmarks           void BM_go_forward<3072>(benchmark::State&)
46        2.4211  525       5.5222  all_benchmarks           void BM_go_forward<2816>(benchmark::State&)
46        2.4211  484       5.0910  all_benchmarks           void BM_go_forward<2560>(benchmark::State&)
39        2.0526  435       4.5756  all_benchmarks           void BM_go_forward<2304>(benchmark::State&)
40        2.1053  403       4.2390  all_benchmarks           void BM_go_forward<2048>(benchmark::State&)
61        3.2105  614       6.4584  all_benchmarks           void BM_go_backward<4096>(benchmark::State&)
59        3.1053  612       6.4374  all_benchmarks           void BM_go_backward<4032>(benchmark::State&)
60        3.1579  606       6.3743  all_benchmarks           void BM_go_backward<3968>(benchmark::State&)
52        2.7368  615       6.4689  all_benchmarks           void BM_go_backward<3840>(benchmark::State&)