C++ 在某个位置或更低位置计算设定位的有效方法是什么？_C++_Algorithm_Performance_Bit Manipulation

C++ 在某个位置或更低位置计算设定位的有效方法是什么？

c++ algorithm performance

C++ 在某个位置或更低位置计算设定位的有效方法是什么？,c++,algorithm,performance,bit-manipulation,C++,Algorithm,Performance,Bit Manipulation,给定std:：位集位，设置任意位数和位位置X（0-63）什么是计算位置X或更低位置的位的最有效方法，或者如果未设置位置X的位，则返回0 注：如果设置位，则返回值将始终至少为1 蛮力方式非常缓慢： int countupto(std::bitset<64> bits, int X) { if (!bits[X]) return 0; int total=1; for (int i=0; i < X; ++i) { total+=bits[i]; }

给定

std:：位集位

，设置任意位数和位位置

（0-63）

什么是计算位置X或更低位置的位的最有效方法，或者如果未设置位置X的位，则返回0

注：如果设置位，则返回值将始终至少为1

蛮力方式非常缓慢：

int countupto(std::bitset<64> bits, int X)
{
  if (!bits[X]) return 0;
  int total=1;
  for (int i=0; i < X; ++i)
  {
    total+=bits[i];
  }
  return total;
}

int countupto（标准：：位集位，int X）
{
如果（！位[X]）返回0；
整数合计=1；
对于（int i=0；i


bitset
的count（）方法将为您提供所有位的popcount
，但bitset
不支持范围
注意：这不是一个dup，因为它询问所有不在0到X范围内的位
我的直接反应是测试指定的位，并立即返回0
如果你通过了，创建一个设置了该位（以及不太重要的位）的位掩码，并使用原始输入创建和位掩码。然后使用count（）
成员函数获取结果中设置的位数
至于创建掩码：您可以将1左移N位，然后减去1。
假设无符号长
或无符号长
足够容纳64位，您可以调用位。若要将位集数据设为整数，可以调用位（1很容易在位和位下面的位的掩码之间转换，因此类似的方法应该可以工作：
int popcnt(bitset<64> bs, int x) {
    // Early out when bit not set
    if (!bs[x]) return 0;
    // Otherwise, make mask from `x`, mask and count bits
    return (bs & bitset<64>((1UL << x) - 1)).count() + 1;
}

intpopcnt（位集bs，intx）{
//未设置位时提前退出
如果（！bs[x]）返回0；
//否则，从“x”生成掩码，掩码和计数位
返回（bs和位集）(我已经编辑过一个我以前见过的问题，它会检查一个数字中是否有奇数或偶数个位。它是C的，但是它不应该太难把它推入C++。解决的关键是在while循环中。尝试在纸上理解它是如何提取LSB的，然后从X中移除它。其余的代码是S。代码在O（n）中运行，其中n是x中的设置位数。这比线性时间要好得多，我也认为线性时间只有在第一次研究这个问题时才可能实现
#include <stdio.h>

int
count(long x, int pos)
{
    /* if bit at location pos is not set, return 0 */
    if (!((x >> pos) & 1))
    {
        return 0;
    }

    /* prepare x by removing set bits after position pos */
    long tmp = x;
    tmp = tmp >> (pos + 1);
    tmp = tmp << (pos + 1);
    x ^= tmp;

    /* increment count every time the first set bit of x is removed (from the right) */
    int y;
    int count = 0;
    while (x != 0)
    {
        y = x & ~(x - 1);
        x ^= y;
        count++;
    }
    return count;
}

int
main(void)
{
    /* run tests */
    long num = 0b1010111;
    printf("%d\n", count(num, 0)); /* prints: 1 */
    printf("%d\n", count(num, 1)); /* prints: 2 */
    printf("%d\n", count(num, 2)); /* prints: 3 */
    printf("%d\n", count(num, 3)); /* prints: 0 */
    printf("%d\n", count(num, 4)); /* prints: 4 */
    printf("%d\n", count(num, 5)); /* prints: 0 */
    printf("%d\n", count(num, 6)); /* prints: 5 */
}

#包括
int
计数（长x，整数位置）
{
/*如果未设置位置pos处的位，则返回0*/
如果（！（（x>>位置）和1））
{
返回0；
}
/*通过移除位置pos后的设定位来准备x*/
长tmp=x；
tmp=tmp>>（位置+1）；
TMP = TMP 此C++获得G++来发射。我希望它能在其他64位架构上有效编译。（如果有一个HW popcount供
std:：bitset:：count
使用，否则这将始终是缓慢的部分；例如，确保使用g++-march=nehalem
或更高版本，或者-mpopcnt
如果您不想启用任何其他功能，如果您可以将代码限制为仅在支持该x86指令的CPU上运行）：
有关gcc使用-x==~x+1
two的补码标识的背景信息，请参阅。（切向地提到，shl
屏蔽了移位计数，因此我们只需要ecx
的低6位来保持63-pos
。大部分链接都是因为我最近写的，任何仍在读这一段的人都可能会觉得有趣。）
当内联时，其中一些指令将消失（例如，gcc将首先在ecx中生成计数）
使用Glenn的乘法而不是三元运算符idea（由USE\u mul
启用），gcc做到了
    shr     rdi, 63
    imul    eax, edi

在结尾处，而不是xor
/test
/cmovs


哈斯韦尔（多重版本）：

mov r，r
：1融合域uop，0延迟，无执行单元
xor
-归零：1个融合域uop，无执行单元
非
：对于p0/p1/p5/p6，1c延迟，每0.25c吞吐量1个uop
shl
（又名sal
），计数为cl
：p0/p6:2c延迟为3 uops，每2c吞吐量为1 uops。（奇怪的是，Agner Fog的数据表明IvyBridge只需要2 uops。）
popcnt
：p1为1 uop，3c延迟，每1c吞吐量1 uop
shr，imm
：p0/p6的1uOP，1c延迟。每0.5c吞吐量1个
imul r，r
：1OP用于p1，3c延迟
不计算ret

总数：

9个融合域uop，可在2.25个周期内发出（理论上，uop缓存线效应通常会略微限制前端）

对于p0/p6，4个uop（轮班）。对于p1.1，2个uop。任何ALU端口uop都可以每2c执行一个（轮班端口饱和），因此前端是最严重的瓶颈

延迟：从位集准备就绪到结果为：shl
（2）->popcnt
（3）->imul
（3）的关键路径。从pos
准备就绪开始，总共8个周期，或9c，因为not
是额外的1c延迟
最优比特广播
版本将shr
替换为sar
（性能相同），将imul
替换为和
（1c延迟代替3c，在任何端口上运行）因此，唯一的性能变化是将关键路径延迟减少到6个周期。前端的吞吐量仍然是瓶颈。和
能够在任何端口上运行并没有什么区别，除非您将其与端口1上的瓶颈代码混合在一起（而不是只在紧密循环中运行此代码的吞吐量）
cmov（三值运算符）版本：11个融合域UOP（前端：每2.75c一个）。执行单元：在移位端口（p0/p6）上仍然以每2c一个的速度受到瓶颈限制。延迟：位集7c
; the original ternary-operator version.  See below for the optimal version we can coax gcc into emitting.
popcount_subset(std::bitset<64ul>, int):
    ; input bitset in rdi, input count in esi (SysV ABI)
    mov     ecx, esi    ; x86 variable-count shift requires the count in cl
    xor     edx, edx    ; edx=0 
    xor     eax, eax    ; gcc's workaround for popcnt's false dependency on the old value of dest, on Intel
    not     ecx         ; two's complement bithack for 63-pos (in the low bits of the register)
    sal     rdi, cl     ; rdi << ((63-pos) & 63);  same insn as shl (arithmetic == logical left shift)
    popcnt  rdx, rdi
    test    rdi, rdi    ; sets SF if the high bit is set.
    cmovs   rax, rdx    ; conditional-move on the sign flag
    ret

    shr     rdi, 63
    imul    eax, edi

popcount_subset(std::bitset<64ul>, int):
    mov     ecx, 63
    sub     ecx, esi      ; larger code size, but faster on CPUs without mov-elimination
    shl     rdi, cl       ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi      ; doesn't start a fresh dep chain before this, like gcc does
    sar     rdi, 63       ; broadcast the sign bit
    and     eax, edi      ; eax = 0 or its previous value
    ret

// hand-tuned BMI2 version using the NOT trick and the bitbroadcast
popcount_subset(std::bitset<64ul>, int):
    not     esi           ; The low 6 bits hold 63-pos.  gcc's two-s complement trick
    xor     eax, eax      ; break false dependency on Intel.  maybe not needed when inlined.
    shlx    rdi, rdi, rsi ; rdi << ((63-pos) & 63)
    popcnt  rax, rdi
    sar     rdi, 63       ; broadcast the sign bit: rdi=0 or -1
    and     eax, edi      ; eax = 0 or its previous value
    ret

   // hand-tuned, not compiler output
        mov       ecx, esi    ; ICC uses neg/add/mov :/
        not       ecx
        xor       eax, eax    ; breaks the false dep, or is the return value in the taken-branch case
        shl       rdi, cl
        jns    .bit_not_set
        popcnt    rax, rdi
.bit_not_set:
        ret