C中检查设置位的非常快速的方法_C_Performance_Bit_Bit Shift_Bitstream

C中检查设置位的非常快速的方法

c performance

C中检查设置位的非常快速的方法,c,performance,bit,bit-shift,bitstream,C,Performance,Bit,Bit Shift,Bitstream,我在代码中使用了某种比特流，它有一个read\u bit（）-函数。这个函数经常被调用（在一个流中调用超过十亿次）。这就是结构位流的外观： typedef struct BitStream { unsigned char* data; unsigned int size; unsigned int currentByte; unsigned char buffer; unsigned char bitsInBuffer; } BitStream; 而re

我在代码中使用了某种比特流，它有一个

read\u bit（）

-函数。这个函数经常被调用（在一个流中调用超过十亿次）。这就是结构位流的外观：

typedef struct BitStream {
    unsigned char* data;
    unsigned int size;
    unsigned int currentByte;
    unsigned char buffer;
    unsigned char bitsInBuffer;
} BitStream;

而

read_bit（）

-函数的定义如下：

unsigned char bitstream_read_bit(BitStream* stream, unsigned long long bitPos) {
    unsigned int byte = bitPos / 8;
    unsigned char byteVal = stream->data[byte];
    unsigned char mask = 128 >> (bitPos & 7);
    if (mask & byteVal) {
        return 1;
    } else {
        return 0;
    }
}

现在，我通过反复试验发现行

unsigned char mask=128>（bitPos&7）非常慢。有什么办法可以让我加快检查速度吗？我已经尝试使用一个数组来索引8个不同的可能掩码，但这并没有更快（我认为是由于内存访问）
编辑：在过去的一周里，我尝试了很多答案，执行了很多基准测试，但没有多少性能改进。通过颠倒比特流中的比特顺序，我最终获得了10秒的改进。因此，我没有使用掩码128>（bitPos&7）
，而是使用了以下函数：
unsigned char bitstream_read_bit_2(BitStream* stream, const unsigned long long bitPos) {
    unsigned int byte = (unsigned int) (bitPos / 8);
    unsigned char byteVal = stream->data[byte];
    unsigned char mod = bitPos & 7;
    return (byteVal & (1 << mod)) >> mod;
}

unsigned char bitstream\u read\u bit\u 2（bitstream*stream，const unsigned long long bitspos）{
无符号整数字节=（无符号整数）（位pos/8）；
无符号字符字节=流->数据[字节]；
无符号字符mod=bitPos&7；
返回（byteVal&（1>mod；
}

显然，我还更改了相应的写入函数。
以下是我最初优化代码的方式：
unsigned char bitstream_read_bit(BitStream* stream, unsigned long long bitPos) 
{
    return !!(stream->data[(bitPos / 8)] & (128 >> (bitPos % 8)));
}

但是函数调用开销本身可能比其内部的位调整代码更多。因此，如果您真的想进一步优化它，让我们利用内联并将其转换为宏：
#define bitstream_read_bit(stream, bitPos) (!!((stream)->data[((bitPos) / 8)] & (128 >> ((bitPos) % 8))))

第一个明显的改进是移动加载的值而不是掩码：
unsigned char bitstream_read_bit(BitStream* stream, unsigned long long bitPos) {
    unsigned int byte = bitPos / 8;
    unsigned char byteVal = stream->data[byte];
    unsigned char maskVal = byteVal >> (bitPos & 7);
    return maskVal & 1;
}

这消除了对条件的需要（如果

或

！

或

？：

，则不需要条件）

如果可以修改

结构

，我建议使用大于字节的单位进行访问：

#include <stddef.h>
#include <limits.h>
#include <stdbool.h>

typedef struct WBitStream
{
  size_t *data;
  size_t size;
} WBitStream;

bool Wbitstream_read_bit(WBitStream* stream, size_t bitPos)
{
  size_t location = bitPos / (sizeof(size_t)*CHAR_BIT);
  size_t locval = stream->data[location];
  size_t maskval = locval >> (bitPos & (sizeof(size_t)*CHAR_BIT-1));
  return maskval & 1;
}

#包括
#包括
#包括
类型定义结构WBitStream
{
大小*数据；
大小；
}WBitStream；
bool Wbitstream\u读取位（Wbitstream*流，大小\u t位位置）
{
size\u t location=bitPos/（sizeof（size\u t）*字符位）；
size_t locval=流->数据[位置]；
size_t maskval=locval>>（位pos和（sizeof（size_t）*字符位1））；
返回maskval&1；
}

在某些处理器（尤其是普通x86）上，移位量的掩码是NOP，因为处理器的本机移位指令只考虑移位量的低位。至少gcc知道这一点。

与初始源代码相比，我已经测试了优化的宏：

static unsigned char tMask[8] = { 128, 64, 32, 16, 8, 4, 2, 1 };

#define BITSTREAM_READ_BIT1(stream, bitPos) (((128 >> (bitPos & 7)) & stream->data[bitPos >> 3])!=0)
#define BITSTREAM_READ_BIT2(stream, bitPos) (((tMask[(bitPos & 7)]) & stream->data[bitPos >> 3])!=0)

用数组中的掩码替换掩码计算不会提高性能。主要差距是函数和宏之间的差距（在我的计算机上调用80.000.000次，速度快了6倍）

静态内联使用离宏不远

目前的速度有多慢？可以接受的速度有多慢（但比当前速度快）？可以为此投入多少内存？可以包含当前实现的反汇编吗？这也节省了时间：

return（（mask&byteVal）！=0）

。可能还有

bitPos/8

bitPos>>3

。您正在使用哪些优化选项？完整的场景是什么？您多长时间花一次这28秒，为什么将其减少到23秒很重要？如果您调用函数1e+9次，您可能会按顺序执行此操作-您应该使用此选项以获得优势。Inst一个掩码数组的ead，一个长的

开关

可能会更有效。你可以在一次调用中读取多个位吗？例如，将整个字节解压成一个8字节的向量。我认为用一次乘法应该是可能的。优化这段代码将非常困难，因为它所做的一切都非常便宜。因此，也许可以优化更高级别的。它不会重要的是。函数调用开销远远超过了进行低效位调整操作的成本。但这并不意味着我们不能将两种解决方案结合在一起。或者通过在函数前面加上

static inline

？@Mike上次我检查时，gcc能够以两个I的功率优化/和%中间。它们不应比等效位运算慢。