C++ 将字节数组（uint8_t）转换为字数组（uint16_t），反之亦然_C++_Performance_Image Processing_Visual C++

C++ 将字节数组（uint8_t）转换为字数组（uint16_t），反之亦然

c++ performance image-processing visual-c++

C++ 将字节数组（uint8_t）转换为字数组（uint16_t），反之亦然,c++,performance,image-processing,visual-c++,C++,Performance,Image Processing,Visual C++,我有一段非常耗时的代码需要优化，它将字节数组转换为单词数组，反之亦然。该操作用于在8位和16位图像数据之间进行转换数组是qword对齐的，并且足够大，可以存储结果从字节到字的转换需要与257相乘（因此0转换为0，255得到65535）一个简单的解决方案可能是 void simpleBytesToWords(void *ptr, int pixelCount) { for (int i = pixelCount - 1; i >= 0; --i) reinter

我有一段非常耗时的代码需要优化，它将字节数组转换为单词数组，反之亦然。该操作用于在8位和16位图像数据之间进行转换

数组是qword对齐的，并且足够大，可以存储结果

从字节到字的转换需要与257相乘（因此0转换为0，255得到65535）

一个简单的解决方案可能是

void simpleBytesToWords(void *ptr, int pixelCount)
{
    for (int i = pixelCount - 1; i >= 0; --i)
        reinterpret_cast<uint16_t*>(ptr)[i] = reinterpret_cast<uint8_t*>(ptr)[i] * 0x101;
}

void simpleBytesToWords（void*ptr，int pixelCount）
{
对于（int i=pixelCount-1；i>=0；--i）
重新解释强制转换（ptr）[i]=重新解释强制转换（ptr）[i]*0x101；
}

我还试图通过一次转换4个字节来利用64位寄存器来提高性能：

void bytesToWords(void *ptr, int pixelCount)
{
    const auto fastCount = pixelCount / 4;

    if (fastCount > 0)
    {
        for (int f = fastCount-1; f >= 0; --f)
        {
            auto bytes = uint64_t{ reinterpret_cast<const uint32_t*>(ptr)[f] };

            auto r2 = uint64_t{ bytes & 0xFF };
            bytes <<= 8;
            r2 |= bytes & 0xFF0000;
            bytes <<= 8;
            r2 |= bytes & 0xFF00000000ull;
            bytes <<= 8;
            r2 |= bytes & 0xFF000000000000ull;

            r2 *= 0x101;

            reinterpret_cast<uint64_t*>(ptr)[f] = r2; 
        }
    }

    if (pixelCount % 4)
    {
        auto source = reinterpret_cast<const uint8_t*>(ptr);
        auto target = reinterpret_cast<uint16_t*>(ptr);

        for (int i = fastCount * 4; i < pixelCount; ++i)
        {
            target[i] = (source[i] << 8) | source[i];
        }
    }

}

void bytesToWords（void*ptr，int pixelCount）
{
常量自动快速计数=像素计数/4；
如果（快速计数>0）
{
对于（int f=fastCount-1；f>=0；--f）
{
自动字节=uint64{重新解释转换（ptr）[f]}；
自动r2=uint64{bytes&0xFF}；
字节编译完代码后，我尝试了两件事（我刚刚重命名了bytesToWords（）
，现在是下面的groupedBytesToWords（）
）：

对两个函数的测试：它们不会产生相同的结果。使用simpleBytesToWords（）
我会得到一个零填充数组。使用groupedBytesToWords（）
我会得到有效结果和零的交替
在不改变它们的情况下，假设错误修复不会改变它们的复杂性，我尝试了我编写的第三个，它使用了一个预先计算的uint8\u t
->uint16\u t
表，该表必须首先构建：

这是这个表。它很小，因为它只有255个条目，每个可能的uint8\t
：
// Build a precalculation table for each possible uint8_t -> uint16_t conversion 
const size_t sizeTable(std::numeric_limits<uint8_t>::max());

uint16_t * precalc_table = new uint16_t[sizeTable];

for (uint16_t i = 0; i < sizeTable; ++i)
{
    precalc_table[i] = i * 0x101;
}

然后，我使用一个500000000uint16\u t
长数组进行了一些比较，该数组最初填充了随机的uint8\u t
值
fillBuffer(buffer, sizeBuf);
begin = clock();
simpleBytesToWords(buffer, sizeBuf);
end = clock();
std::cout << "simpleBytesToWords(): " << (double(end - begin) / CLOCKS_PER_SEC) << std::endl;

当然，这并不代表一个真实有效的基准，但它表明您的“分组”函数在我的机器上是慢的，这与你得到的结果不一致。它也显示了比乘法运算更精确，而不是在飞上的乘/乘有点帮助。
如果你只想在x86上做这个，那么你可以考虑使用SIMD（SSE/AVX）。。这不起作用。您正在读取输入数据之前覆盖它。这只有在向后循环时才能正常工作，从数组的后面开始。为了加快代码速度，不要乘法，只需将一个字节复制到输出字的上下字节。@Cris Luengo:您是对的。我在stackoverflow上写这段代码是因为我只有d“优化”版本。我认为它现在已经修复。复制确实比乘法需要更长的时间，因为您需要[复制、移位或]或者您必须分离内存写入。使用位移位，而不是除法。编译器无法为您优化它，因为在积分提升开始并将两个操作数转换为有符号int
后，它们是不等效的。对于simpleBytesToWords
中的错误，我感到非常抱歉。直接在stackoverflow上编写，并且没有在中迭代正确的方向X-（.现在应该是正确的。@AndreasH.这没问题。向前或向后迭代的复杂性大致相同，因此上面的执行时间比较仍然适用。这是最后一个问题。我已经在我的奔腾G4650（家用PC）上测试了你的表方法它的速度比SimeByTestOWORDS慢一点。G4650用任何方法都能达到最大2.4像素/秒。这正是DDR时钟。我将在我的办公室PC上再试一次，它是一个I7—67。（VisualC++优化设置最大）。@AndreasH。根据测试环境的不同，看到这些差异是很有趣的。如果没有帮助的话，我将收回这个答案。正如Ben Voigt建议的那样，位移位是一种有趣的探索方法。位移位而不是将我的测试代码从150%乘以150%（与参考实现相比）超过800%。所以a=（uint16_{b}
void hopefullyFastBytesToWords(uint16_t *ptr, size_t pixelCount, uint16_t const * precalc_table)
{
    for (size_t i = 0; i < pixelCount; ++i)
    {
        ptr[i] = precalc_table[ptr[i]];
    }
}

hopefullyFastBytesToWords(buffer, sizeBuf, precalc_table);

fillBuffer(buffer, sizeBuf);
begin = clock();
simpleBytesToWords(buffer, sizeBuf);
end = clock();
std::cout << "simpleBytesToWords(): " << (double(end - begin) / CLOCKS_PER_SEC) << std::endl;

$ Sandbox.exe
simpleBytesToWords(): 0.681
groupedBytesToWords(): 1.2
hopefullyFastBytesToWords(): 0.461

$ Sandbox.exe
simpleBytesToWords(): 0.737
groupedBytesToWords(): 1.251
hopefullyFastBytesToWords(): 0.414

$ Sandbox.exe
simpleBytesToWords(): 0.582
groupedBytesToWords(): 1.173
hopefullyFastBytesToWords(): 0.436