Cuda 优化一个非常简单的图像处理内核

Cuda 优化一个非常简单的图像处理内核,cuda,Cuda,我希望有人能帮我一把。我已经在CUDA中取得了进展,并编写了一个简单的内核来否定一个映像。它工作得非常出色,我对此非常满意 我想我相当愚蠢的问题是。。。我可以优化这个内核吗?我尝试使用共享内存,但是像素数是19224000 我试着只做\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。一个CUDA程序员可能会告诉我,我在这里有点迷路了 这是我的内核: __global__ void

我希望有人能帮我一把。我已经在CUDA中取得了进展,并编写了一个简单的内核来否定一个映像。它工作得非常出色,我对此非常满意

我想我相当愚蠢的问题是。。。我可以优化这个内核吗?我尝试使用共享内存,但是像素数是19224000

我试着只做
\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。一个CUDA程序员可能会告诉我,我在这里有点迷路了

这是我的内核:

__global__ void cuda_negate_image(int * new_array, int * old_array, int rows, int cols){

    int tIdx = threadIdx.x;
    int i = blockDim.x * blockIdx.x + threadIdx.x;
    int n = rows * cols;

   if (i < n)
        new_array[i] = -(old_array[i]) + 255;

}
\uuuuu全局\uuuuu无效cuda\u否定\u图像(int*新数组、int*旧数组、int行、int列){
int tIdx=threadIdx.x;
int i=blockDim.x*blockIdx.x+threadIdx.x;
int n=行*列;
if(i

任何帮助都会很棒

这里没有太多的优化空间。对于内存受限的简单操作,四条黄金法则通常是:

  • 合并内存读写
  • 使用合并内存访问时,最大化每个内存事务的字节数
  • 使用适当的编译器试探法确保发出的代码是最优的
  • 在可行的情况下,通过让每个线程处理多个输入来摊销线程调度和设置开销。(注意,这需要采用不同的方法来选择执行网格参数,即设备利用率的大小,而不是可用工作量的总量)
  • 将这些原则应用到内核中,我得到如下结果:

    __device__ __forceinline__ void negate(int &in, int &out)
    {
       out = 255 - in;
    }
    __device__ __forceinline__ void negate(int2 &in, int2 & out)
    {
       negate(in.x, out.x);
       negate(in.y, out.y);
    }
    __device__ __forceinline__ void negate(int4 &in, int4 & out)
    {
       negate(in.x, out.x);
       negate(in.y, out.y);
       negate(in.z, out.z);
       negate(in.w, out.w);
    }
    template<typename T>
    __global__ void cuda_negate_image(T * __restrict__ new_array, T * __restrict__ old_array, int n)
    {
    
       int i = blockDim.x * blockIdx.x + threadIdx.x;
       int stride = blockDim.x * gridDim.x;
    
       T oldval, newval;
       for(; i < n; i += stride) {
          oldval = old_array[i];
          negate(oldval, newval);
          new_array[i] = newval;
       }
    }
    
    template __global__ void cuda_negate_image<int>(int * __restrict__ new_array, int * __restrict__ old_array, int n);
    template __global__ void cuda_negate_image<int2>(int2 * __restrict__ new_array, int2 * __restrict__ old_array, int n);
    template __global__ void cuda_negate_image<int4>(int4 * __restrict__ new_array, int4 * __restrict__ old_array, int n);
    
    \uuuuu设备\uuuuuuu强制内联\uuuuuuu无效否定(int&in,int&out)
    {
    out=255-in;
    }
    __设备\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu强制内联\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
    {
    否定(in.x,out.x);
    否定(in.y,out.y);
    }
    __设备\uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu
    {
    否定(in.x,out.x);
    否定(in.y,out.y);
    否定(in.z,out.z);
    否定(in.w,out.w);
    }
    样板
    __全局无效cuda否定图像(T*.\U限制\新数组,T*.\U限制\旧数组,int n)
    {
    int i=blockDim.x*blockIdx.x+threadIdx.x;
    int stride=blockDim.x*gridDim.x;
    奥尔德瓦尔,纽瓦尔;
    对于(;i

    只有在目标硬件上进行基准测试才能告诉您哪个版本的代码最快,以及这是否值得费心。

    预先计算
    的乘积,并将其传递给内核。每个线程还可以处理多个像素。不过,这种影响可以忽略不计,因为您的内核受到内存吞吐量的限制。事实上,这是我错过的一件非常简单的事情。沃泽斯。我可以在我的内核中使用共享内存或合并内存吗?预计算n几乎什么都不会给你。为了优化内存带宽限制的代码,您希望看到执行128位事务(即
    int4
    ),每个线程执行多个事务,如果允许,还可以使用
    \uuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuuu。谢谢你抽出时间帮我一把!我一定要在这里潜水,尽可能多地学习。干杯@我可以问一下你是如何生成代码的吗?@GregKasapidis:我不确定我是否理解你的要求。我是在编辑器中手工编写代码的。没有自动代码生成工具,如果这是你的意思的话。@Talonmes是的,这就是我的想法,因为你在其中定义了每一个否定选项,我仍然觉得有点奇怪,因为函数上的名称相同。我猜编译器会根据参数确定使用哪一个。谢谢你帮我解决这个问题。