C 使用AVX2查找元素索引-代码优化_C_Linux_Performance_Simd_Avx2

C 使用AVX2查找元素索引-代码优化

c linux performance

C 使用AVX2查找元素索引-代码优化,c,linux,performance,simd,avx2,C,Linux,Performance,Simd,Avx2,我正在摆弄AVX2来编写一些代码，能够在包含14个条目的数组中搜索32位散列，并返回找到的条目的索引由于绝大多数命中率很可能在数组的前8个条目内，因此添加\uuuuuu builtin\uexpect这不是我现在的优先事项，因此此代码已经可以改进虽然哈希数组（在由变量hashes表示的代码中）的长度始终为14个条目，但它包含在此类结构中 typedef struct chain_ring chain_ring_t; struct chain_ring { uint32_t hashe

我正在摆弄AVX2来编写一些代码，能够在包含14个条目的数组中搜索32位散列，并返回找到的条目的索引

由于绝大多数命中率很可能在数组的前8个条目内，因此添加\uuuuuu builtin\uexpect这不是我现在的优先事项，因此此代码已经可以改进

虽然哈希数组（在由变量hashes表示的代码中）的长度始终为14个条目，但它包含在此类结构中

typedef struct chain_ring chain_ring_t;
struct chain_ring {
    uint32_t hashes[14];
    chain_ring_t* next;
    ...other stuff...
} __attribute__((aligned(16)))

这是密码

int8_t hash32_find_14_avx2(uint32_t hash, volatile uint32_t* hashes) {
    uint32_t compacted_result_mask, leading_zeroes;
    __m256i cmp_vector, ring_vector, result_mask_vector;
    int8_t found_index = -1;

    if (hashes[0] == hash) {
        return 0;
    }

    for(uint8_t base_index = 0; base_index < 14; base_index += 8) {
        cmp_vector = _mm256_set1_epi32(hash);
        ring_vector = _mm256_stream_load_si256((__m256i*) (hashes + base_index));

        result_mask_vector = _mm256_cmpeq_epi32(ring_vector, cmp_vector);
        compacted_result_mask = _mm256_movemask_epi8(result_mask_vector);

        if (compacted_result_mask != 0) {
            leading_zeroes = 32 - __builtin_clz(compacted_result_mask);
            found_index = base_index + (leading_zeroes >> 2u) - 1;
            break;
        }
    }

    return found_index > 13 ? -1 : found_index;
}

谢谢你的建议

——更新

我已经更新了主旨，用@ ChTZ做的无分支实现，用Y.TZCNTU-U32替换了α-LZCNT32，我不得不稍微改变了行为，考虑到当32返回而不是1时发现没有，但实际上并不重要。他们运行的CPU是Intel Core i7 8700（6c/12t，3.20GHZ）

工作台使用cpu固定，比物理或逻辑cpu核使用更多线程，并执行一些额外的操作，特别是for循环，因此有开销，但两个测试之间的开销相同，因此它应该以相同的方式影响它们

如果您想运行测试，您需要调整CPU_核心逻辑_计数，以手动匹配CPU的逻辑CPU核心数

有趣的是，当存在更多争用时（从单个线程到64个线程），性能改进如何从+17%跃升到+41%。在使用AVX2时，我还对128和256个线程进行了一些测试，发现速度提高了+60%，但我没有包括下面的数字

（bench_template_hash32_find_14_avx2将无分支版本设置为bench_template_hash32_find_14_avx2，我缩短了名称以使文章更具可读性）

通过比较数组的两个重叠部分（位或它们一起）并使用单个

lzcnt

获得最后一个位位置，您可以完全不使用分支来实现这一点。另外，使用

vmovmskps

而不是

vpmovmskb

可以将结果除以4（但我不确定这是否会导致任何域交叉延迟）

int8\u t哈希32\u find\u 14\u avx2（uint32\u t哈希，volatile uint32\u t*哈希）{
uint32\u t压缩结果\u掩码=0；
__m256i cmp_向量=_mm256_set1_epi32（散列）；
对于（uint8基本指数=0；基本指数<12；基本指数+=6）{
__m256i环向量=_mm256_loadu_si256（（u m256i*）（散列+基索引））；
__m256i结果掩码向量=mm256_cmpeq_epi32（环向量，cmp向量）；
压缩的结果掩码=\u mm256\u移动掩码\u ps（\u mm256\u castsi256\u ps（结果掩码向量））通过比较数组的两个重叠部分（位或它们在一起）并使用单个lzcnt
获得最后一位位置，您可以完全不使用分支来实现这一点。此外，使用vmovmskps
而不是vpmovmskb
保存结果除以4的结果（但我不确定这是否会导致任何跨域延迟）
int8\u t哈希32\u find\u 14\u avx2（uint32\u t哈希，volatile uint32\u t*哈希）{
uint32\u t压缩结果\u掩码=0；
__m256i cmp_向量=_mm256_set1_epi32（散列）；
对于（uint8基本指数=0；基本指数<12；基本指数+=6）{
__m256i环向量=_mm256_loadu_si256（（u m256i*）（散列+基索引））；
__m256i结果掩码向量=mm256_cmpeq_epi32（环向量，cmp向量）；
压缩的结果掩码=\u mm256\u移动掩码\u ps（\u mm256\u castsi256\u ps（结果掩码向量））\u mm256\u stream\u load\u si256
？您的数据是否在视频RAM中，或者您是否以某种方式将内存页映射为WC，而不是普通WB可缓存？如果不是，则vmovntdqa
加载只是普通加载的缓慢版本。此外，请使用\u mm256\u movemask\u ps
，或packssdw
/packsswb
您的dword向量在移动掩码epi8之前使用以太网，因此每个分支可以获得更多数据。内置clz
对于0是未定义的，事实上，gcc愉快地将31-\u内置clz（x）
优化为bsr
（对于零输入也是未定义的）。由于您需要前导零计数，因此可能需要\u lzcnt\u u32
而不是GNU C内置。我认为所有AVX2机器也都有lzcnt
（以及BMI1的其余部分）除非你真的想要32-clz
而不是31-clz
N.B.：如果我正确理解了你的基准代码，你的循环
结果很可能有缺陷，因为每次测试同一个索引都会给你近乎完美的分支预测（实际上也是在您的分支avx2代码中）。当然，除非您实际期望这种行为。您应该在本地计算机上使用-march=native
进行编译，以适当设置调优选项，并让编译器使用您的所有CPU功能（如cmpxchg16b、FMA和BMI2）.\u mm256\u stream\u load\u si256
？您的数据是否在视频RAM中，或者您是否以某种方式将内存页映射为WC，而不是普通WB可缓存？如果不是，则vmovntdqa
加载只是普通加载的缓慢版本。此外，使用\u mm256\u movemask\u ps
，或packssdw
/packsswb

您的dword向量在移动掩码epi8之前使用以太网，因此每个分支可以获得更多数据。

内置clz

对于0是未定义的，事实上，gcc愉快地将

31-\u内置clz（x）

优化为

bsr

（对于零输入也是未定义的）。由于您需要前导零计数，因此可能需要

\u lzcnt\u u32

而不是GNU C内置。我认为所有AVX2机器也都有

lzcnt

（以及BMI1的其余部分），所以你也不会因为要求BMI1而错过任何东西。除非你真的想要

32-clz

而不是

31-clz

N.B.：如果我正确理解你的基准代码，你的

循环

结果很可能有缺陷

int8_t hash32_find_14_loop(uint32_t hash, volatile uint32_t* hashes) {
    for(uint8_t index = 0; index <= 14; index++) {
        if (hashes[index] == hash) {
            return index;
        }
    }

    return -1;
}

----------------------------------------------------------------------------------------------------
Benchmark                                                          Time             CPU   Iterations
----------------------------------------------------------------------------------------------------
bench_template_hash32_find_14_loop/0/iterations:100000000       0.610 ns        0.610 ns    100000000
bench_template_hash32_find_14_loop/1/iterations:100000000        1.16 ns         1.16 ns    100000000
bench_template_hash32_find_14_loop/2/iterations:100000000        1.18 ns         1.18 ns    100000000
bench_template_hash32_find_14_loop/3/iterations:100000000        1.19 ns         1.19 ns    100000000
bench_template_hash32_find_14_loop/4/iterations:100000000        1.28 ns         1.28 ns    100000000
bench_template_hash32_find_14_loop/5/iterations:100000000        1.26 ns         1.26 ns    100000000
bench_template_hash32_find_14_loop/6/iterations:100000000        1.52 ns         1.52 ns    100000000
bench_template_hash32_find_14_loop/7/iterations:100000000        2.15 ns         2.15 ns    100000000
bench_template_hash32_find_14_loop/8/iterations:100000000        1.66 ns         1.66 ns    100000000
bench_template_hash32_find_14_loop/9/iterations:100000000        1.67 ns         1.67 ns    100000000
bench_template_hash32_find_14_loop/10/iterations:100000000       1.90 ns         1.90 ns    100000000
bench_template_hash32_find_14_loop/11/iterations:100000000       1.89 ns         1.89 ns    100000000
bench_template_hash32_find_14_loop/12/iterations:100000000       2.13 ns         2.13 ns    100000000
bench_template_hash32_find_14_loop/13/iterations:100000000       2.20 ns         2.20 ns    100000000
bench_template_hash32_find_14_loop/14/iterations:100000000       2.32 ns         2.32 ns    100000000
bench_template_hash32_find_14_loop/15/iterations:100000000       2.53 ns         2.53 ns    100000000
bench_template_hash32_find_14_sse/0/iterations:100000000        0.531 ns        0.531 ns    100000000
bench_template_hash32_find_14_sse/1/iterations:100000000         1.42 ns         1.42 ns    100000000
bench_template_hash32_find_14_sse/2/iterations:100000000         2.53 ns         2.53 ns    100000000
bench_template_hash32_find_14_sse/3/iterations:100000000         1.45 ns         1.45 ns    100000000
bench_template_hash32_find_14_sse/4/iterations:100000000         2.26 ns         2.26 ns    100000000
bench_template_hash32_find_14_sse/5/iterations:100000000         1.90 ns         1.90 ns    100000000
bench_template_hash32_find_14_sse/6/iterations:100000000         1.90 ns         1.90 ns    100000000
bench_template_hash32_find_14_sse/7/iterations:100000000         1.93 ns         1.93 ns    100000000
bench_template_hash32_find_14_sse/8/iterations:100000000         2.07 ns         2.07 ns    100000000
bench_template_hash32_find_14_sse/9/iterations:100000000         2.05 ns         2.05 ns    100000000
bench_template_hash32_find_14_sse/10/iterations:100000000        2.08 ns         2.08 ns    100000000
bench_template_hash32_find_14_sse/11/iterations:100000000        2.08 ns         2.08 ns    100000000
bench_template_hash32_find_14_sse/12/iterations:100000000        2.55 ns         2.55 ns    100000000
bench_template_hash32_find_14_sse/13/iterations:100000000        2.53 ns         2.53 ns    100000000
bench_template_hash32_find_14_sse/14/iterations:100000000        2.37 ns         2.37 ns    100000000
bench_template_hash32_find_14_sse/15/iterations:100000000        2.59 ns         2.59 ns    100000000
bench_template_hash32_find_14_avx2/0/iterations:100000000       0.537 ns        0.537 ns    100000000
bench_template_hash32_find_14_avx2/1/iterations:100000000        1.37 ns         1.37 ns    100000000
bench_template_hash32_find_14_avx2/2/iterations:100000000        1.38 ns         1.38 ns    100000000
bench_template_hash32_find_14_avx2/3/iterations:100000000        1.36 ns         1.36 ns    100000000
bench_template_hash32_find_14_avx2/4/iterations:100000000        1.37 ns         1.37 ns    100000000
bench_template_hash32_find_14_avx2/5/iterations:100000000        1.38 ns         1.38 ns    100000000
bench_template_hash32_find_14_avx2/6/iterations:100000000        1.40 ns         1.40 ns    100000000
bench_template_hash32_find_14_avx2/7/iterations:100000000        1.39 ns         1.39 ns    100000000
bench_template_hash32_find_14_avx2/8/iterations:100000000        1.99 ns         1.99 ns    100000000
bench_template_hash32_find_14_avx2/9/iterations:100000000        2.02 ns         2.02 ns    100000000
bench_template_hash32_find_14_avx2/10/iterations:100000000       1.98 ns         1.98 ns    100000000
bench_template_hash32_find_14_avx2/11/iterations:100000000       1.98 ns         1.98 ns    100000000
bench_template_hash32_find_14_avx2/12/iterations:100000000       2.03 ns         2.03 ns    100000000
bench_template_hash32_find_14_avx2/13/iterations:100000000       1.98 ns         1.98 ns    100000000
bench_template_hash32_find_14_avx2/14/iterations:100000000       1.96 ns         1.96 ns    100000000
bench_template_hash32_find_14_avx2/15/iterations:100000000       1.97 ns         1.97 ns    100000000

------------------------------------------------------------------------------------------
Benchmark                                                                 CPU   Iterations
------------------------------------------------------------------------------------------
bench_template_hash32_find_14_loop/iterations:10000000/threads:1      45.2 ns     10000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:2      50.4 ns     20000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:4      52.1 ns     40000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:8      70.9 ns     80000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:16     86.8 ns    160000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:32     87.3 ns    320000000
bench_template_hash32_find_14_loop/iterations:10000000/threads:64     92.9 ns    640000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:1      38.4 ns     10000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:2      42.1 ns     20000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:4      46.5 ns     40000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:8      52.6 ns     80000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:16     60.0 ns    160000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:32     62.1 ns    320000000
bench_template_hash32_find_14_avx2/iterations:10000000/threads:64     65.8 ns    640000000

int8_t hash32_find_14_avx2(uint32_t hash, volatile uint32_t* hashes) {
    uint32_t compacted_result_mask = 0;
    __m256i cmp_vector = _mm256_set1_epi32(hash);
    for(uint8_t base_index = 0; base_index < 12; base_index += 6) {
        __m256i ring_vector = _mm256_loadu_si256((__m256i*) (hashes + base_index));

        __m256i result_mask_vector = _mm256_cmpeq_epi32(ring_vector, cmp_vector);
        compacted_result_mask |= _mm256_movemask_ps(_mm256_castsi256_ps(result_mask_vector)) << (base_index);
    }
    int32_t leading_zeros = __lzcnt32(compacted_result_mask);
    return (31 - leading_zeros);
}