C++ C++；AMP:array_视图和索引行为异常（存储了意外值）_C++_Performance_Debugging_C++ Amp

C++ C++；AMP:array_视图和索引行为异常（存储了意外值）

c++ performance debugging

C++ C++；AMP:array_视图和索引行为异常（存储了意外值）,c++,performance,debugging,c++-amp,C++,Performance,Debugging,C++ Amp,我已经编写了一个小的测试函数，它的行为与我所希望的不同基本上，它应该读取一个数组并写回它的内容（稍后，当它工作时，它应该做更多的工作，但现在即使这样也失败了）在GPU代码中调试时，我看到最初的几次迭代（以某种方式并行执行..这对GPU来说可能是有意义的，但在调试时让我惊讶）运行良好。。但是，在1-2调试继续（F5）后，一些以前正确设置的值将被0覆盖。我真的不明白。。当我再次使用CPU时，许多值都是0，即使它们不应该是0（基本上，它们应该有原始数据，这是一个简单的测试序列） #包括“stdaf

我已经编写了一个小的测试函数，它的行为与我所希望的不同

基本上，它应该读取一个数组并写回它的内容（稍后，当它工作时，它应该做更多的工作，但现在即使这样也失败了）

在GPU代码中调试时，我看到最初的几次迭代（以某种方式并行执行..这对GPU来说可能是有意义的，但在调试时让我惊讶）运行良好。。但是，在1-2调试继续（F5）后，一些以前正确设置的值将被0覆盖。我真的不明白。。当我再次使用CPU时，许多值都是0，即使它们不应该是0（基本上，它们应该有原始数据，这是一个简单的测试序列）

#包括“stdafx.h”
#包括
typedef无符号字符字节；
使用名称空间并发；
void AmpChangeBrightnessContractWrapper2（字节*a，整数长度，浮点亮度，浮点对比度）
{
阵列视图dst（len/4，（无符号整数*）a）；
//dst.丢弃_数据（）；
每个（dst.extent，[=]（索引idx）限制（amp）的并行（U）
{
//拆分为字节（以浮点形式）
浮点数1=（dst[idx]）-（dst[idx]>>8）*256；
//这完全失败！float temp1=dst[idx]&0xFF；
浮点数2=（dst[idx]>>8）-（dst[idx]>>16）*256；
浮点数3=（dst[idx]>>16）-（dst[idx]>>24）*256；
float temp4=（dst[idx]>>24）；
//转换回整数数组
dst[idx]=（int）（temp1+temp2*256+temp3*65536+temp4*16777216）；
});
//dst.synchronize（）；
}
int _tmain（int argc，_TCHAR*argv[]
{
const int size=30000；
字节*a=新字节[大小]；
//生成一些唯一的测试序列。前99个数字仅为0..98
对于（int i=0；i


如此简单（计划）的步骤：

初始化数组
将数组传递给GPU（作为无符号整数数组）
将每个无符号整数拆分为4个字节，并将它们存储在浮点数中
（做一些计算，为简单起见此处省略）
将浮点数中存储的字节再次连接到原始位置
（重复）

万一你想知道。。这应该是颜色值
结果是：

有些值与预期值相同，但大多数值不同
似乎特别是字节0（每个无符号int）将有一个坏值
我第一次尝试用&0xFF转换无符号int->byte->float，但似乎完全失败了

输出是（但应该是从0开始递增的数字）：
0,1,2,3,0,5,6,7,0,9,10,11,16,13,14,15,0,17,18,19,32,21,22，
23, 32, 25, 26, 27, 32, 29, 30, 31, 0, 33, 34, 35, 64, 37, 38, 39, 64, 41, 42,
43,64,45,46,47,64,49
问题:

为什么&0xFF有问题
为什么每个无符号整数的字节0都会分配一个奇怪的值
我想我不能创建字节数组视图，我必须使用int或float
注释掉.synchronize最终没有改变任何事情-为什么
嗯。。因此，在反复尝试之后，回答我自己的问题：

&0xFF工作正常，>>和嗯。。因此，在反复尝试之后，回答我自己的问题：

&0xFF工作正常>>和
•我想我无法创建字节数组视图，我必须使用int或float
无法创建字节的数组或数组视图。C++ AMP只支持C++类型的有限子集。可以使用纹理而不是阵列视图。对于图像处理，这有几个优点，尤其是打包和解包速度更快，因为它是由GPU的硬件实现的。请参阅下面的完整示例
•注释掉.synchronize最终没有改变任何东西-为什么
您不需要使用dst.synchronize（）。顺便说一句，您不应该在函数开始时调用dst.discard_data（）
，因为如果这样做，将意味着a
中的数据将不会复制到GPU
下面是一个使用纹理的实现。注意事项：

使用unit_4的纹理可以让您打包和解包
你的数据是免费的
使用clamp（）比使用if子句更好，首先它使用一个内在函数
硬件针对以下方面进行了优化。一般来说，内核内的分支是不好的，因为它会暂停所有进程
扭曲中的线程，即使它们将条件评估为false
您需要两个纹理，因为与阵列不同，它们不支持别名
我已经删除了一些临时变量。变量使用寄存器空间，这在
GPU。您应该尽量减少对它的使用，以确保所有线程都可以执行
不等待寄存器空间可用
使用静态_cast的显式强制转换意味着编译器警告更少，并且通常会被考虑
好（现代）C++风格。

而代码
void AMPChangeBrightnessContrastWrapper3(const byte* a, const int len, 
    const float brightness, const float contrast)
{
    const int pixel_len = len / 4;
    graphics::texture<graphics::uint_4, 1> inputTx(pixel_len, a, len, 8u);
    graphics::texture<graphics::uint_4, 1> outputTx(pixel_len, 8u);
    graphics::writeonly_texture_view<graphics::uint_4, 1> outputTxVw(outputTx);

    parallel_for_each( outputTxVw.extent, [=, &inputTx, &outputTx](index<1> idx) 
        restrict(amp) 
    { 
        const graphics::uint_4 v = inputTx[idx];

        float tmp = static_cast<float>(v.r);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp1_ = static_cast<unsigned int>(tmp);

        tmp = static_cast<float>(v.g);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp2_ = static_cast<unsigned int>(tmp);

        tmp = static_cast<float>(v.b);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp3_ = static_cast<unsigned int>(tmp);

        tmp = static_cast<float>(v.a);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp4_ = static_cast<unsigned int>(tmp);        

        outputTxVw.set(idx, graphics::uint_4(temp1_, temp2_, temp3_, temp4_));
    });
    copy(outputTx, (void*)a, len);
}

void AmpChangeBrightnessContractWrapper3（常量字节*a，常量整数len，
恒定浮动亮度、恒定浮动对比度）
{
const int pixel_len=len/4；
图形：纹理输入X（像素长，a，长，8u）；
图形：纹理输出X（像素长度，8u）；
图形：：writeonly_纹理_视图输出xvw（输出x）；
每个单元的并行单元（输出xvw.extent、[=、&inputx、&outputx]（索引idx）
限制（安培）
{ 
常数
void AMPChangeBrightnessContrastWrapper
    (byte* a, int len, float brightness, float contrast)
{
    array_view<unsigned int> dst(len/4, (unsigned int*)a);
    parallel_for_each(dst.extent, [=](index<1> idx) restrict(amp) 
    {
        float temp1 = dst[idx] & 0xFF;
        temp1 = (temp1 - 128) * contrast + brightness + 128;
        if (temp1 < 0)
            temp1 = 0;
        if (temp1 > 255)
            temp1 = 255;

        float temp2 = (dst[idx] >> 8) & 0xFF;
        temp2 = (temp2 - 128) * contrast + brightness + 128;
        if (temp2 < 0)
            temp2 = 0;
        if (temp2 > 255)
            temp2 = 255;

        float temp3 = (dst[idx] >> 16) & 0xFF;
        temp3 = (temp3 - 128) * contrast + brightness + 128;
        if (temp3 < 0)
            temp3 = 0;
        if (temp3 > 255)
            temp3 = 255;

        float temp4 = (dst[idx] >> 24);
        temp4 = (temp4 - 128) * contrast + brightness + 128;
        if (temp4 < 0)
            temp4 = 0;
        if (temp4 > 255)
            temp4 = 255;

        unsigned int temp1_ = (unsigned int)temp1;
        unsigned int temp2_ = (unsigned int)temp2;
        unsigned int temp3_ = (unsigned int)temp3;
        unsigned int temp4_ = (unsigned int)temp4;
        unsigned int res = temp1_ + (temp2_ << 8) + (temp3_ << 16) + (temp4_ << 24);
        dst[idx] = res;
    });
    dst.synchronize();
}

void AMPChangeBrightnessContrastWrapper3(const byte* a, const int len, 
    const float brightness, const float contrast)
{
    const int pixel_len = len / 4;
    graphics::texture<graphics::uint_4, 1> inputTx(pixel_len, a, len, 8u);
    graphics::texture<graphics::uint_4, 1> outputTx(pixel_len, 8u);
    graphics::writeonly_texture_view<graphics::uint_4, 1> outputTxVw(outputTx);

    parallel_for_each( outputTxVw.extent, [=, &inputTx, &outputTx](index<1> idx) 
        restrict(amp) 
    { 
        const graphics::uint_4 v = inputTx[idx];

        float tmp = static_cast<float>(v.r);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp1_ = static_cast<unsigned int>(tmp);

        tmp = static_cast<float>(v.g);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp2_ = static_cast<unsigned int>(tmp);

        tmp = static_cast<float>(v.b);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp3_ = static_cast<unsigned int>(tmp);

        tmp = static_cast<float>(v.a);
        tmp = (tmp - 128) * contrast + brightness + 128;
        tmp = direct3d::clamp(tmp, 0.0f, 255.0f);
        const unsigned int temp4_ = static_cast<unsigned int>(tmp);        

        outputTxVw.set(idx, graphics::uint_4(temp1_, temp2_, temp3_, temp4_));
    });
    copy(outputTx, (void*)a, len);
}