Dynamic 设备中的动态分配导致内存复制失败_Dynamic_Cuda_Allocation_Memcpy

Dynamic 设备中的动态分配导致内存复制失败

dynamic cuda

Dynamic 设备中的动态分配导致内存复制失败,dynamic,cuda,allocation,memcpy,Dynamic,Cuda,Allocation,Memcpy,我正在使用CUDA驱动程序API。简化的问题描述如下所示： //.cu文件，编译为ptx文件 extern "C" __global__ void SomeFunction(char* d_buffer) { float* p = malloc(sizeof(float) * 100); // Allocate memory per thread do some calculation with allocated memory. // About 5x10^5 threads.

我正在使用CUDA驱动程序API。简化的问题描述如下所示：

//.cu文件，编译为ptx文件

extern "C" __global__ void SomeFunction(char* d_buffer) {
    float* p = malloc(sizeof(float) * 100); // Allocate memory per thread
    do some calculation with allocated memory. // About 5x10^5 threads.
    do some other calculation with d_buffer.
    free(p)
}

//.cpp文件

int main()
{   // Allocate device buffer
    CUdeviceptr d_buffer;
    cuMemAlloc(&d_buffer, bytes);
    // Allocate host buffer 
    char* h_buffer = new char(bytes); 
    // copy host buffer to device buffer 
    cuMemcpyHtoD(h_buffer, d_buffer, bytes);

    CUfunction func;
    cuModuleGetFunction(&func, module, "SomeFunction");
    cuLaunchKernel(func, grid_dims,...,block_dims,...,args,...);
    // copy device buffer to host buffer 
    cuMemcpyDtoH(d_buffer, h_buffer, bytes); // Failed! 
}

问题是.cpp文件最后一行中的复制操作失败。但是，如果我在.cu文件中注释掉了动态分配（malloc，free），那么复制操作将成功。我的问题是，在驱动程序API中使用动态分配是否有任何限制？如果是，这些是什么？如何在驱动程序API中正确使用动态分配

我的问题是，在驱动程序API中使用动态分配是否有任何限制

不超过运行时API中的值

如何在驱动程序API中正确使用动态分配

需要认识到的重要一点是，内核之后的复制失败，因为内核本身在运行时出错

如中所述，运行时内核分配来自固定大小的堆，默认为8Mb。如果耗尽该堆，内核中的

malloc

调用将失败，调用将返回

NULL

。这是您可以测试的条件。我猜你不会，然后你的“使用分配的内存进行一些计算”会取消对空指针的引用，并导致崩溃

要在驱动程序API中纠正这一点，您需要使用

CU_LIMIT\u MALLOC\u HEAP_SIZE

参数调用，并将堆大小设置为更真实的大小（考虑设备上驻留线程的最大数量x每个线程的字节数，四舍五入到最接近的16字节页面对齐方式，再加上安全裕度）。如果你这样做，事情可能会开始工作