C++ 使用内部函数提取和移位奇偶位_C++_Bit Manipulation_Intrinsics_Micro Optimization

C++ 使用内部函数提取和移位奇偶位

c++

C++ 使用内部函数提取和移位奇偶位,c++,bit-manipulation,intrinsics,micro-optimization,C++,Bit Manipulation,Intrinsics,Micro Optimization,有没有一种方法可以使用intrinsic优化以下代码？它将16位整数中的所有奇数索引位移到尽可能右的位置我想也许是用Fortran的iSfTC的C++等价物（这是不是有一个C++等价物？）但我觉得有一种更有效的方法 intx=some16bit； x=x&0x5555； int y=0；对于（int i=0；i>i）和（0x01当然，下面是方法： int y = (int)_pext_u32( (unsigned int)some16bitInt, 0x5555 ); 不幸的是，此指令来自

有没有一种方法可以使用intrinsic优化以下代码？它将16位整数中的所有奇数索引位移到尽可能右的位置

我想也许是用Fortran的iSfTC的C++等价物（这是不是有一个C++等价物？）但我觉得有一种更有效的方法

intx=some16bit；
x=x&0x5555；
int y=0；
对于（int i=0；i<8；i++）
y=y |（（x>>i）和（0x01当然，下面是方法：
int y = (int)_pext_u32( (unsigned int)some16bitInt, 0x5555 );

不幸的是，此指令来自BMI2集合，需要相对较新的CPU、Intel Haswell或更新的处理器、AMD挖掘机或更新的处理器。但在支持它的地方，它的速度非常快。
只是一个循环。C并不直接具有此功能，但您可以通过可移植+安全地编写一个函数，使具有模式识别功能的编译器能够识别并编译为s单旋转指令
我不确定这是否是一个有用的构件，但它是可用的

在具有BMI2指令集扩展的x86上，有一条pext
位提取指令，可与0x5555控制输入一起使用。
有关和\u u64

在Intel Haswell及更高版本（1 uop、3周期延迟、1/时钟吞吐量）上速度非常快，

但是AMD的速度非常慢（Ryzen:7Uops，18个周期的延迟/吞吐量）。我认为这比我用纯C设计的移位/掩码更糟糕，特别是在延迟很重要的情况下（不仅仅是吞吐量）
我将低位向左移位而不是高位向右移位，因为x86可以用一条指令LEA左移位和加法。在其他ISAs上，将位向右移位可能会在末尾节省一次移位
这对AArch64和PowerPC64以及x86编译得非常好。Clang看穿了PowerPC的这种位操作，并使用了功能强大的rlwinm
（向左旋转单词立即数和掩码）和rlwimi
（…掩码插入）指令：）
#叮当作响的中继线-用于PowerPC64的O3。
#编译x+=x&0x1111；版本，并不是x+=xIt在Intel上的唯一快速版本。AMD支持它，但例如Ryzen将其作为7个uop运行，具有18c延迟和吞吐量（与Intel的1个uop、3个周期延迟、1c吞吐量相比）。它可能仍然比AMD上最好的手动位破解更快，但它并不“快”。
#include <immintrin.h>

unsigned extract_even_bits_bmi2(unsigned a) {
   return _pext_u32(a, 0x5555);
}

unsigned pack_even_bits16_v2(unsigned x)
{
    x &= 0x5555;        // 0a0b0c0d0e0f0g0h
    x += x<<1;          // aabbccddeeffgghh    // x86 LEA eax, [rdi + rdi*2]
    unsigned move = x &  0b0000011000000110;   // bits to move
    unsigned keep = x &  0b0110000001100000;   // bits to keep
    x = keep + (move << 2);  // 0abcd000 0efgh000

                       // 0abcd000 0efgh000    // with byte boundary shown
    unsigned tmp = x >> 7;  // high group into place, shifting out the low bits
    x &= 0xFF;    // grab the whole low byte ; possibly with a zero-latency movzx
    x = (x>>3) | tmp;
    return x;
}

# clang trunk -O3 for PowerPC64.
# Compiling the  x += x & 0x1111;  version, not the  x += x<<1 version where we get a multiply
        andi. 4, 3, 21845        # x & 0x5555
        andi. 3, 3, 4369         # x & 0x1111
        add 4, 4, 3              # 
        rlwinm 3, 4, 31, 30, 31  # isolate the low 2 bits.  PPC counts bits from MSB=0 LSB=31 for 32-bit registers
        rlwimi 3, 4, 29, 28, 29  # insert the next 2-bit bitfield
        rlwimi 3, 4, 27, 26, 27  # ...
        rlwimi 3, 4, 25, 24, 25
        blr

   unsigned tmp = x & mask;
    x += tmp;          // left shift those bits
    x += tmp<<1;       // left shift them again.  (x86 can do this with LEA eax, [rax + rdx*2])

    unsigned tmp = x &   0b0000011000000110;   // bits to move
    x ^= tmp;          // clear those bits
    x += tmp << 2;     // LEA eax, [eax + edx*4]  1 fast instruction on x86