C++ 高斯模糊的SSE优化_C++_Optimization_Sse_Simd_Gaussianblur

C++ 高斯模糊的SSE优化

c++ optimization

C++ 高斯模糊的SSE优化,c++,optimization,sse,simd,gaussianblur,C++,Optimization,Sse,Simd,Gaussianblur,我正在做一个学校的项目，我必须优化SSE中的部分代码，但我已经在其中一部分上停留了几天在这段代码中，我看不到任何使用向量SSE指令（内联汇编程序/instric f）的聪明方法（它是guassian模糊算法的一部分）。如果有人能给我一个小小的暗示，我会很高兴的 for (int x = x_start; x < x_end; ++x) // vertical blur... { float sum = image[x + (y_start - radius

我正在做一个学校的项目，我必须优化SSE中的部分代码，但我已经在其中一部分上停留了几天

在这段代码中，我看不到任何使用向量SSE指令（内联汇编程序/instric f）的聪明方法（它是guassian模糊算法的一部分）。如果有人能给我一个小小的暗示，我会很高兴的

for (int x = x_start; x < x_end; ++x)     // vertical blur...
    {
        float sum = image[x + (y_start - radius - 1)*image_w];
        float dif = -sum;

        for (int y = y_start - 2*radius - 1; y < y_end; ++y)
        {                                                   // inner vertical Radius loop           
            float p = (float)image[x + (y + radius)*image_w];   // next pixel
            buffer[y + radius] = p;                         // buffer pixel
            sum += dif + fRadius*p;
            dif += p;                                       // accumulate pixel blur

            if (y >= y_start)
            {
                float s = 0, w = 0;                         // border blur correction
                sum -= buffer[y - radius - 1]*fRadius;      // addition for fraction blur
                dif += buffer[y - radius] - 2*buffer[y];    // sum up differences: +1, -2, +1

                // cut off accumulated blur area of pixel beyond the border
                // assume: added pixel values beyond border = value at border
                p = (float)(radius - y);                   // top part to cut off
                if (p > 0)
                {
                    p = p*(p-1)/2 + fRadius*p;
                    s += buffer[0]*p;
                    w += p;
                }
                p = (float)(y + radius - image_h + 1);               // bottom part to cut off
                if (p > 0)
                {
                    p = p*(p-1)/2 + fRadius*p;
                    s += buffer[image_h - 1]*p;
                    w += p;
                }
                new_image[x + y*image_w] = (unsigned char)((sum - s)/(weight - w)); // set blurred pixel
            }
            else if (y + radius >= y_start)
            {
                dif -= 2*buffer[y];
            }
        } // for y
    } // for x

for（int x=x_开始；x=y_开始）
{
浮点s=0，w=0；//边界模糊校正
sum-=缓冲区[y-半径-1]*fRadius；//分数模糊加法
dif+=缓冲区[y-半径]-2*缓冲区[y]；//求和差异：+1，-2，+1
//切断边界外像素的累积模糊区域
//假设：添加的像素值超出边框=边框处的值
p=（浮动）（半径-y）；//要切断的顶部
如果（p>0）
{
p=p*（p-1）/2+fRadius*p；
s+=缓冲区[0]*p；
w+=p；
}
p=（浮动）（y+半径-图像_h+1）；//要切断的底部零件
如果（p>0）
{
p=p*（p-1）/2+fRadius*p；
s+=缓冲区[image_h-1]*p；
w+=p；
}
新的_图像[x+y*图像_w]=（无符号字符）（（sum-s）/（weight-w））；//设置模糊像素
}
否则，如果（y+半径>=y\U开始）
{
dif-=2*缓冲区[y]；
}
}//对于y
}//对于x

您可以使用的另一个功能是逻辑操作和掩码：

例如，而不是：

  // process only 1
if (p > 0)
    p = p*(p-1)/2 + fRadius*p;

你可以写

  // processes 4 floats
const __m128 &mask = _mm_cmplt_ps(p,0);
const __m128 &notMask = _mm_cmplt_ps(0,p);
const __m128 &p_tmp = ( p*(p-1)/2 + fRadius*p );
p = _mm_add_ps(_mm_and_ps(p_tmp, mask), _mm_and_ps(p, notMask)); // = p_tmp & mask + p & !mask

我还建议您使用一个特殊的库，它重载指令。例如：

dif

变量使内部循环的迭代依赖于。您应该尝试将外部循环并行化。但是如果没有超负荷的out指令，代码将变得不可管理

>P>也考虑对整个算法的反思。当前的一个看起来不平行。可能是您可以忽略精度，或者稍微增加标量时间

你在学校学过SSE吗？这很酷。是的：），这是一个关于高级汇编程序的自愿性主题，但最后期限快到了，我在这个问题上坚持了很长时间：/不幸的是，如果你想使用SSE，我认为你必须完全重新实现它。你应该预先计算一个1D的系数核，然后使用SSE在每个轴上执行卷积。我希望我能正确地理解你，mate，我已经试过了，它是有效的，但问题在于sum+=dif+fRadius*p，其中我需要使用上一个周期的dif，这是我在计算4个周期时无法获得的once@user2174310，我明白了。您应该尝试将外部循环并行化。但是如果没有指令重载，代码将变得不可管理，然后使用外循环并行化，速度将提高1.4.7x（大约），谢谢大家的建议