Asynchronous 当我调用异步CUDA内核时，它的参数是如何复制的？_Asynchronous_Cuda

Asynchronous 当我调用异步CUDA内核时，它的参数是如何复制的？

asynchronous cuda

Asynchronous 当我调用异步CUDA内核时，它的参数是如何复制的？,asynchronous,cuda,Asynchronous,Cuda,假设我想调用CUDA内核，如下所示： struct foo { int a; int b; float c; double d; } foo arg; // fill in elements of `arg` here my_kernel<<<grid_size, block_size, 0, stream>>>(arg); struct foo { int a; int b; float c; double d; } foo arg; foo *arg_d

假设我想调用CUDA内核，如下所示：

struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);

struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here

cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

dot_product<<<1,n>>>(n, d_a, d_b);

struct foo{inta；intb；float c；double d；}
富精蛋白；
//在此处填写'arg'的元素
我的内核（arg）；

假设

stream

先前是通过调用

cudaStreamCreate（）

创建的，因此上述操作将异步执行。我关心的是

arg

所需的生存期

调用内核时，是同步复制内核的参数（因此

arg

立即超出范围是安全的），还是异步复制内核的参数（因此我需要确保它在内核运行之前保持活动状态）？

在执行前复制内核调用的参数，因此，范围将不受关注。但请注意，所有内核参数的大小都不能超过以字节为单位的最大大小。如果需要更大的结构或数据块，则需要使用cudaMalloc在设备上分配已用内存，然后使用cudaMemcpy将主机结构的内容复制到设备结构，并使用指向新设备结构的指针调用内核

您的代码如下所示：

struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);

struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here

cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

dot_product<<<1,n>>>(n, d_a, d_b);

struct foo{inta；intb；float c；double d；}
富精蛋白；
foo*arg\d；
//在此处填写'arg'的元素
cudamaloc（和arg_d，sizeof（foo））；
//在这里检查分配情况
cudaMemcpy（arg_d和arg，sizeof（foo），cudamemcpyhostodice）；
我的内核（arg_d）；

参数在启动时同步复制。API公开了一个调用堆栈，执行参数和函数参数按顺序推送到该堆栈上，然后调用将这些参数最终放入驱动程序内部流/命令队列上的CUDA内核启动中

此过程没有文档记录，但从CUDA 7.5开始，运行时API内核启动如下：

struct foo { int a; int b; float c; double d; }
foo arg;
// fill in elements of `arg` here
my_kernel<<<grid_size, block_size, 0, stream>>>(arg);

struct foo { int a; int b; float c; double d; }
foo arg;
foo *arg_d;
// fill in elements of `arg` here

cudaMalloc(&arg_d, sizeof(foo));
// check the allocation here
cudaMemcpy(arg_d, &arg, sizeof(foo), cudaMemcpyHostToDevice);
my_kernel<<<grid_size, block_size, 0, stream>>>(arg_d);

dot_product<<<1,n>>>(n, d_a, d_b);

其中主机存根函数

dot_product

扩展为：

void __device_stub__Z11dot_productiPfS_(int __par0, float *__par1, float *__par2)
{
    if (cudaSetupArgument((void *)(char *)&__par0, sizeof(__par0), (size_t)0UL) != cudaSuccess) return;
    if (cudaSetupArgument((void *)(char *)&__par1, sizeof(__par1), (size_t)8UL) != cudaSuccess) return;
    if (cudaSetupArgument((void *)(char *)&__par2, sizeof(__par2), (size_t)16UL) != cudaSuccess) return;
    {
        volatile static char *__f __attribute__((unused)); __f = ((char *)((void ( *)(int, float *, float *))dot_product)); 
        (void)cudaLaunch(((char *)((void ( *)(int, float *, float *))dot_product))); 
    };
}

void dot_product( int __cuda_0,float *__cuda_1,float *__cuda_2)
{
    __device_stub__Z11dot_productiPfS_( __cuda_0,__cuda_1,__cuda_2);
}

cudaSetupArgument

是将参数推送到调用堆栈上的API调用。有趣的是，CUDA 7.5的API文档中实际上不推荐使用这种方法，即使编译器正在使用它。因此，我希望这种情况在将来会发生变化，但想法是一样的。

内核启动将创建自己的

arg

副本，因此如果

arg

在内核实际开始执行之前超出范围就可以了。内核启动的这一方面是否有文档记录？我在我的一些代码中遇到了一个问题，我怀疑如果参数没有同步复制，可能会导致这个问题。我想检查以排除它，但我找不到关于它的权威声明。