C++ 从位置i开始制作n个面具的最快方法_C++_Optimization_Bit Manipulation_Bitmask

C++ 从位置i开始制作n个面具的最快方法

c++ optimization

C++ 从位置i开始制作n个面具的最快方法,c++,optimization,bit-manipulation,bitmask,C++,Optimization,Bit Manipulation,Bitmask,从位置pos开始，将len位设置为1的掩码生成的最快方法是什么（就普通现代体系结构的cpu周期而言）： template <class UIntType> constexpr T make_mask(std::size_t pos, std::size_t len) { // Body of the function } // Call of the function auto mask = make_mask<uint32_t>(4, 10); // mask

从位置

pos

开始，将

len

位设置为1的掩码生成的最快方法是什么（就普通现代体系结构的cpu周期而言）：

template <class UIntType>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
    // Body of the function
}

// Call of the function
auto mask = make_mask<uint32_t>(4, 10);
// mask = 00000000 00000000 00111111 11110000 
// (in binary with MSB on the left and LSB on the right)

模板
constexpr T make_屏蔽（标准：尺寸位置，标准：尺寸长度）
{
//函数体
}
//函数的调用
自动遮罩=制作遮罩（4,10）；
//掩码=00000000 00000000 00111111 11110000
//（二进制格式，左侧为MSB，右侧为LSB）

另外，是否有任何编译器内部函数或函数可以提供帮助？

如果“从

pos

开始”，则表示掩码的最低阶位位于与2pos对应的位置（如您的示例所示）：

（如果也可能是

pos

）≥ <代码>标准：：数字限制：：数字，您需要另一个三值运算测试。）

您还可以使用：

(UIntType(1)<<(len>>1)<<((len+1)>>1) - UIntType(1)) << pos

（UIntType（1）1）1-UIntType（1））最快的方法？我会用这样的方法：
template <class T>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
  return ((static_cast<T>(1) << len)-1) << pos;
}

模板
constexpr T make_屏蔽（标准：尺寸位置，标准：尺寸长度）
{
返回（（static_cast（1）可能使用表？对于类型uint32\u t
，您可以编写：
static uint32_t masks[] = { 0x0, 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f...}; // only 32 such masks
return masks[len] << pos;

static uint32_t掩码[]={0x0，0x1，0x3，0x7，0xf，0x1f，0x3f…}；//只有32个这样的掩码
返回掩码[len]速度在这里是不相关的，因为表达式是常数，因此由优化器预先计算，并且很可能用作立即操作数。无论使用什么，它都将花费您0个周期。
这里最大的问题是可能输入的范围。在C中，。然而，它看起来像len
可以有意义地从0到widt类型h、 例如，uint32_t有33种不同的长度。pos=0时，我们得到的掩码从0到0xFFFFFFFF。（为了清晰起见，我将假设32位为英语和asm，但使用通用C++）
如果我们可以排除该范围的任意一端作为可能的输入，那么只有32个可能的长度，我们可以使用左移或右移作为构建块。（使用assert（）
在调试构建中验证输入范围。）

我放置了函数的几个版本（来自其他答案）
有些宏使用常量len、常量pos或两个输入变量编译它们。有些宏做得比其他宏好。KIIV在其有效范围内（len=0..31，pos=0..31）看起来不错
此版本适用于len=1..32和pos=0..31。它生成的x86-64 asm比KIIV的稍差，因此如果它在没有额外检查的情况下工作，请使用KIIV
// right-shift a register of all-ones, then shift it into position.
// works for len=1..32 and pos=0..31
template <class T>
constexpr T make_mask_PJC(std::size_t pos, std::size_t len)
{
//  T all_ones = -1LL;
//  unsigned typebits = sizeof(T)*CHAR_BIT;  // std::numeric_limits<T>::digits
//  T len_ones = all_ones >> (typebits - len);
//  return len_ones << pos

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

// Same idea, but mask the shift count the same way x86 shift instructions do, so the compiler can do it for free.
// Doesn't always compile to ideal code with SHRX (BMI2), maybe gcc only knows about letting the shift instruction do the masking for the older SHR / SHL instructions
uint32_t make_mask_PJC_noUB(std::size_t pos, std::size_t len)
{
  using T=uint32_t;

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");

  T all_ones = -1LL;
  unsigned typebits = std::numeric_limits<T>::digits;
  T len_ones = all_ones >> ( (typebits - len) & (typebits-1));     // the AND optimizes away
  return len_ones << (pos & (typebits-1));

//  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

//将所有的寄存器右移，然后将其移位到位。
//适用于len=1..32和pos=0..31
模板
constexpr T make__mask_PJC（标准：尺寸位置，标准：尺寸长度）
{
//T所有值=-1LL；
//无符号typebits=sizeof（T）*字符位；//std:：numeric\u limits:：digits
//T len_ones=所有len_ones>>（typebits-len）；
//return len\u ones>（std:：numeric\u limits:：digits-len）这是否必须涵盖len
与类型的位数相同的情况？这会增加额外的复杂性如果函数用于len
=类型的位数
（在这种情况下，pos
之后的所有位都设置为1
）如果len等于或大于该类型的位数，则如果使用(1@dwelch：如果int
为32位，1U@rici：相关：.val次要建议：将（T）1
更改为静态（1）
将使括号不那么像LISP，并且可能更易于阅读。（就我个人而言，我会使用static_cast，但C样式的cast在这里也可以）。我把它放在了。这个答案编译成了非常有效的代码；看起来比从-1LL
开始并将其移动要好。BMI2的shlx
指令使它非常有效（由于它比英特尔Haswell/BDW/Skylake上的常规shl
快得多，即使是常规变量计数shl r32，cl
也只有2个周期的延迟（和3个UOP））（请参见，并且）如果len
可以是32（或任何类型宽度），但不能是0，那么您应该从static_cast（-1LL）开始如果len
可以是0，但不是32，那么这个答案是理想的。如果len
可以是0或32，并且两者都需要工作，那么你需要比这两种解决方案中的任何一种都更奇特的解决方案。（Jean Baptiste的查找表可以工作，如果你确定len
不需要范围检查的话。（它需要LUT 33个条目，从0到0xFFFFFF）+1。BZI可能是最快的，因为您不需要对表进行内存访问，但假设这是在一个紧密的循环中发生的（如果不是，为什么要优化它呢？），那么表访问可能也一样好。即使是带有（pos，len）索引的完整二维表也是可以想象的，64²=4096个条目（其中一半是无用的，除非你想玩三角索引）@YvesDaoust:这听起来是个糟糕的主意。在L1中不太可能保持热状态。即使通过查表进行部分查表也听起来很危险，除非您的代码具有如此多的指令级并行性，以更多延迟为代价减少UOP是有价值的。在最近的Intel CPU上，L1加载使用延迟约为4个周期，但我认为KIIV的函数可以d有(1@PeterCordes：我知道，但问题是关于固定的len/pos值，因此您可以期望相同的值被永远重用。无论如何，所有这些讨论都是不必要的，请看我的答案。@YvesDaoust:是的，我看到了您的答案并对其进行了投票。我假设其他答案试图对len/pos不是编译时的情况有用nstants，假设OP只是使用常量来简化示例。
static uint32_t masks[] = { 0x0, 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f...}; // only 32 such masks
return masks[len] << pos;

// right-shift a register of all-ones, then shift it into position.
// works for len=1..32 and pos=0..31
template <class T>
constexpr T make_mask_PJC(std::size_t pos, std::size_t len)
{
//  T all_ones = -1LL;
//  unsigned typebits = sizeof(T)*CHAR_BIT;  // std::numeric_limits<T>::digits
//  T len_ones = all_ones >> (typebits - len);
//  return len_ones << pos

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

// Same idea, but mask the shift count the same way x86 shift instructions do, so the compiler can do it for free.
// Doesn't always compile to ideal code with SHRX (BMI2), maybe gcc only knows about letting the shift instruction do the masking for the older SHR / SHL instructions
uint32_t make_mask_PJC_noUB(std::size_t pos, std::size_t len)
{
  using T=uint32_t;

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");

  T all_ones = -1LL;
  unsigned typebits = std::numeric_limits<T>::digits;
  T len_ones = all_ones >> ( (typebits - len) & (typebits-1));     // the AND optimizes away
  return len_ones << (pos & (typebits-1));

//  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

uint32_t make_mask_fullrange(std::size_t pos, std::size_t len)
{
  using T=uint32_t;

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");

  T all_ones = -1LL;
  unsigned typebits = std::numeric_limits<T>::digits;
  //T len_ones = all_ones >> ( (typebits - len) & (typebits-1));
  T len_ones = len==0 ? 0 : all_ones >> ( (typebits - len) & (typebits-1));
  return len_ones << (pos & (typebits-1));

//  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}