Memory 内存要求CUDA_Memory_Cuda

Memory 内存要求CUDA

memory cuda

Memory 内存要求CUDA,memory,cuda,Memory,Cuda,我最近编写了一个非常简单的内核： __device__ uchar elem(const Matrix m, int row, int col) { if(row == -1) { row = 0; } else if(row > m.rows-1) { row = m.rows-1; } if(col == -1) { col = 0; } else if(col > m.cols-1) {

我最近编写了一个非常简单的内核：

__device__ uchar elem(const Matrix m, int row, int col) {
    if(row == -1) {
        row = 0;
    } else if(row > m.rows-1) {
        row = m.rows-1;
    }

    if(col == -1) {
        col = 0;
    } else if(col > m.cols-1) {
        col = m.cols-1;
    }
    return *((uchar*)(m.data + row*m.step + col));
}

/**
* Each thread will calculate the value of one pixel of the image 'res'
*/
__global__ void resizeKernel(const Matrix img, Matrix res) {
    int row = threadIdx.y + blockIdx.y * blockDim.y;
    int col = threadIdx.x + blockIdx.x * blockDim.x;

    if(row < res.rows && col < res.cols) {
        uchar* e = res.data + row * res.step + col;

        *e = (elem(img, 2*row, 2*col) >> 2) +
             ((elem(img, 2*row, 2*col-1) + elem(img, 2*row, 2*col+1) 
             + elem(img, 2*row-1, 2*col) + elem(img, 2*row+1, 2*col)) >> 3) +
             ((elem(img, 2*row-1, 2*col-1) + elem(img, 2*row+1, 2*col+1)
             + elem(img, 2*row+1, 2*col-1) + elem(img, 2*row-1, 2*col+1)) >> 4);
    }
}

\uuuuuuuuuuuuuuuuu设备uuuuuchar元素（常量矩阵m，int行，int列）{
如果（行==-1）{
行=0；
}否则，如果（行>m.rows-1）{
row=m.rows-1；
}
如果（列==-1）{
col=0；
}否则如果（col>m.cols-1）{
col=m.cols-1；
}
返回*（（uchar*）（m.data+row*m.step+col））；
}
/**
*每个线程将计算图像“res”的一个像素值
*/
__全局无效调整内核（常量矩阵img，矩阵res）{
int row=threadIdx.y+blockIdx.y*blockDim.y；
int col=threadIdx.x+blockIdx.x*blockDim.x；
如果（行>2）+
（（元素（img，2*行，2*列-1）+元素（img，2*行，2*列+1）
+元素（img，2*行-1，2*列）+元素（img，2*行+1，2*列））>>3）+
（（要素（img，2*行-1，2*列-1）+要素（img，2*行+1，2*列+1）
+元素（img，2*row+1，2*col-1）+元素（img，2*row-1，2*col+1））>>4）；
}
}

基本上，它所做的是使用较大图像的值计算缩小图像的像素值。在resizeKernel中的“if”中

我的第一次测试没有正常工作。所以，为了弄清楚发生了什么，我开始评论这篇文章中的几行。一旦我减少了操作数量，它就开始工作了

我当时的理论是，它可能与存储表达式中间结果的可用内存有关。因此，通过减少每个块的线程数，它可以完美地开始工作，而无需减少操作数

根据这一经验，我想知道如何更好地估计每个块的线程数，以避免内存需求超过可用内存。我怎么知道上面的操作需要多少内存？（当我们使用它时，它是什么类型的内存？缓存、共享内存等）

谢谢

主要是寄存器，您可以通过向编译内核的nvcc调用添加

-Xptxas=“-v”

选项来了解每个线程的寄存器消耗量。汇编程序将返回编译代码使用的每个线程的寄存器数、静态共享内存、本地内存和常量内存

NVIDIA制作占用率计算器电子表格（），您可以在其中插入汇编程序的输出，以查看块大小的可行范围及其对GPU占用率的影响。CUDA编程指南的第3章还详细讨论了占用率的概念以及块大小和内核资源需求如何相互作用