Cuda 如何解释方括号中的数字?

Cuda 如何解释方括号中的数字?,cuda,profiling,nvidia,nvprof,Cuda,Profiling,Nvidia,Nvprof,内核名称后方括号中显示的数字与启动该内核的CUDA API相关。() 内核名称后方括号中显示的数字为 94, 105, 2191年 2198年 那么CUDA API[94](和其他)到底是什么 ==27706==分析应用程序:matrixMul [使用CUDA进行矩阵乘法]-正在启动。。。 GPU设备0:“GeForce GT 640M LE”,具有3.0计算能力 MatrixA(320320),MatrixB(640320) 使用CUDA内核的计算结果。。。 完成 性能=35.36 GF

内核名称后方括号中显示的数字与启动该内核的CUDA API相关。()

内核名称后方括号中显示的数字为

  • 94,
  • 105,
  • 2191年
  • 2198年
那么CUDA API[94](和其他)到底是什么


==27706==分析应用程序:matrixMul
[使用CUDA进行矩阵乘法]-正在启动。。。
GPU设备0:“GeForce GT 640M LE”,具有3.0计算能力
MatrixA(320320),MatrixB(640320)
使用CUDA内核的计算结果。。。
完成
性能=35.36 GFlop/s,时间=3.707毫秒,大小=131072000次操作,工作组大小=1024个线程/块
检查计算结果的正确性:OK
注:有关峰值性能,请参考matrixMulCUBLAS示例。
==27706==分析结果:
开始持续时间网格大小块大小Regs*SSMem*DSMem*大小吞吐量设备上下文流名称
133.81ms 135.78us----409.60KB 3.0167GB/s GeForce GT 640M 1 2[CUDA memcpy HtoD]
134.62ms 270.66us----819.20KB 3.0267GB/s GeForce GT 640M 1 2[CUDA memcpy HtoD]
134.90ms 3.7037ms(20 10 1)(32 32 1)29 8.1920KB 0B——GeForce GT 640M 1 2无效矩阵ulcuda(浮点*,浮点*,浮点*,整数,整数)[94]
138.71ms 3.7011ms(20 10 1)(32 32 1)29 8.1920KB 0B——GeForce GT 640M 1 2无效矩阵ulcuda(浮点*,浮点*,浮点*,整数,整数)[105]
1.24341s 3.7011ms(20 10 1)(32 32 1)29 8.1920KB 0B--GeForce GT 640M 1 2无效矩阵MULCUDA(浮点*,浮点*,浮点*,整数,整数)[2191]
1.24711s 3.7046ms(20 10 1)(32 32 1)29 8.1920KB 0B--GeForce GT 640M 1 2无效矩阵ulcuda(浮点*,浮点*,浮点*,整数,整数)[2198]
1.25089s 248.13U----819.20KB 3.3015GB/s GeForce GT 640M 1 2[CUDA memcpy DtoH]
Regs:每个CUDA线程使用的寄存器数。这个数字包括CUDA驱动程序和/或工具在内部使用的寄存器,可能比编译器显示的要多。
SSMem:每个CUDA块分配的静态共享内存。
DSTEM:为每个CUDA块分配的动态共享内存。

如果它说:

内核名称后方括号中显示的数字与启动该内核的CUDA API调用相关

如果您使用
--print api trace
选项运行给定的代码,您将获得该应用程序发出的所有CUDA api调用的顺序列表。如果要按顺序对它们进行编号,则与特定内核启动相关的编号将显示在
--print gpu trace
输出的方括号中

这是一个充分发挥作用的例子。注意api跟踪输出和gpu跟踪输出中
[105]
[106]
[108]
之间的相关性:

$ cat t1.cu
__global__ void k(){}

int main(){

  k<<<1,1>>>();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
   Start  Duration  Name
116.17ms  3.0990us  cuDeviceGetPCIBusId
130.20ms     800ns  cuDeviceGetCount
130.20ms     251ns  cuDeviceGetCount
130.41ms  1.0500us  cuDeviceGet
130.41ms     705ns  cuDeviceGetAttribute
130.42ms     539ns  cuDeviceGetAttribute
130.42ms     547ns  cuDeviceGetAttribute
130.46ms     525ns  cuDeviceGetCount
130.46ms     277ns  cuDeviceGet
130.46ms  59.680us  cuDeviceGetName
130.52ms  63.802us  cuDeviceTotalMem
130.59ms     497ns  cuDeviceGetAttribute
130.59ms     226ns  cuDeviceGetAttribute
130.59ms     282ns  cuDeviceGetAttribute
130.59ms     234ns  cuDeviceGetAttribute
130.59ms     229ns  cuDeviceGetAttribute
130.59ms  34.628us  cuDeviceGetAttribute
130.62ms     372ns  cuDeviceGetAttribute
130.63ms     220ns  cuDeviceGetAttribute
130.63ms     284ns  cuDeviceGetAttribute
130.63ms     237ns  cuDeviceGetAttribute
130.63ms     222ns  cuDeviceGetAttribute
130.63ms     231ns  cuDeviceGetAttribute
130.63ms     288ns  cuDeviceGetAttribute
130.63ms     219ns  cuDeviceGetAttribute
130.63ms  3.1870us  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     275ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     213ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     336ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     213ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     210ns  cuDeviceGetAttribute
130.65ms     215ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms  320.65us  cuDeviceGetAttribute
130.97ms     322ns  cuDeviceGetAttribute
130.97ms     206ns  cuDeviceGetAttribute
130.97ms     218ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.98ms     226ns  cuDeviceGetAttribute
130.98ms     220ns  cuDeviceGetAttribute
130.98ms     212ns  cuDeviceGetAttribute
130.98ms     210ns  cuDeviceGetAttribute
130.98ms     206ns  cuDeviceGetAttribute
130.98ms     207ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms     211ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     229ns  cuDeviceGetAttribute
130.98ms     215ns  cuDeviceGetAttribute
130.98ms     216ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms  316.59us  cuDeviceGetAttribute
131.30ms     266ns  cuDeviceGetAttribute
131.30ms     252ns  cuDeviceGetAttribute
131.30ms     212ns  cuDeviceGetAttribute
131.30ms     235ns  cuDeviceGetAttribute
131.30ms     209ns  cuDeviceGetAttribute
131.30ms     272ns  cuDeviceGetAttribute
131.30ms     207ns  cuDeviceGetAttribute
131.30ms     735ns  cuDeviceGetAttribute
131.30ms     254ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     610ns  cuDeviceGetAttribute
131.31ms     273ns  cuDeviceGetAttribute
131.31ms     412ns  cuDeviceGetAttribute
131.31ms     216ns  cuDeviceGetAttribute
131.31ms     211ns  cuDeviceGetAttribute
131.31ms     205ns  cuDeviceGetAttribute
131.31ms  59.911ms  cudaLaunchKernel (k(void) [105])
191.23ms  11.222us  cudaLaunchKernel (k(void) [106])
191.24ms  5.7860us  cudaDeviceSynchronize
191.25ms  9.2890us  cudaLaunchKernel (k(void) [108])
191.26ms  5.1790us  cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
191.20ms  1.6000us              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [105]
191.22ms     896ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [106]
191.23ms     928ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [108]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
$cat t1.cu
__全局_u; void k(){}
int main(){
k();
k();
cudaDeviceSynchronize();
k();
cudaDeviceSynchronize();
}
$nvcc-o t1.cu
$nvprof—打印api跟踪。/t1
==7206==NVPROF正在分析进程7206,命令:./t1
==7206==分析应用程序:./t1
==7206==分析结果:
开始持续时间名称
116.17ms 3.0990us cuDeviceGetPCIBusId
130.20ms 800ns cuDeviceGetCount
130.20ms 251ns cuDeviceGetCount
130.41ms 1.0500us cuDeviceGet
130.41ms 705ns cuDeviceGetAttribute
130.42ms 539ns cuDeviceGetAttribute
130.42ms 547ns cuDeviceGetAttribute
130.46ms 525ns cuDeviceGetCount
130.46ms 277ns cuDeviceGet
130.46ms 59.680us cuDeviceGetName
130.52ms 63.802us CUDeviceTotalem
130.59ms 497ns cuDeviceGetAttribute
130.59ms 226ns cuDeviceGetAttribute
130.59ms 282ns cuDeviceGetAttribute
130.59ms 234ns cuDeviceGetAttribute
130.59ms 229ns cuDeviceGetAttribute
130.59ms 34.628us cuDeviceGetAttribute
130.62ms 372ns cuDeviceGetAttribute
130.63ms 220ns cuDeviceGetAttribute
130.63ms 284ns cuDeviceGetAttribute
130.63ms 237ns cuDeviceGetAttribute
130.63ms 222ns cuDeviceGetAttribute
130.63ms 231ns cuDeviceGetAttribute
130.63ms 288ns cuDeviceGetAttribute
130.63ms 219ns cuDeviceGetAttribute
130.63ms 3.1870us cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 275ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 213ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 336ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribu
$ cat t1.cu
__global__ void k(){}

int main(){

  k<<<1,1>>>();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
  k<<<1,1>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
   Start  Duration  Name
116.17ms  3.0990us  cuDeviceGetPCIBusId
130.20ms     800ns  cuDeviceGetCount
130.20ms     251ns  cuDeviceGetCount
130.41ms  1.0500us  cuDeviceGet
130.41ms     705ns  cuDeviceGetAttribute
130.42ms     539ns  cuDeviceGetAttribute
130.42ms     547ns  cuDeviceGetAttribute
130.46ms     525ns  cuDeviceGetCount
130.46ms     277ns  cuDeviceGet
130.46ms  59.680us  cuDeviceGetName
130.52ms  63.802us  cuDeviceTotalMem
130.59ms     497ns  cuDeviceGetAttribute
130.59ms     226ns  cuDeviceGetAttribute
130.59ms     282ns  cuDeviceGetAttribute
130.59ms     234ns  cuDeviceGetAttribute
130.59ms     229ns  cuDeviceGetAttribute
130.59ms  34.628us  cuDeviceGetAttribute
130.62ms     372ns  cuDeviceGetAttribute
130.63ms     220ns  cuDeviceGetAttribute
130.63ms     284ns  cuDeviceGetAttribute
130.63ms     237ns  cuDeviceGetAttribute
130.63ms     222ns  cuDeviceGetAttribute
130.63ms     231ns  cuDeviceGetAttribute
130.63ms     288ns  cuDeviceGetAttribute
130.63ms     219ns  cuDeviceGetAttribute
130.63ms  3.1870us  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     275ns  cuDeviceGetAttribute
130.63ms     211ns  cuDeviceGetAttribute
130.63ms     213ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     336ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     210ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     216ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     212ns  cuDeviceGetAttribute
130.64ms     214ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.64ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     213ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     211ns  cuDeviceGetAttribute
130.65ms     210ns  cuDeviceGetAttribute
130.65ms     215ns  cuDeviceGetAttribute
130.65ms     212ns  cuDeviceGetAttribute
130.65ms  320.65us  cuDeviceGetAttribute
130.97ms     322ns  cuDeviceGetAttribute
130.97ms     206ns  cuDeviceGetAttribute
130.97ms     218ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.97ms     212ns  cuDeviceGetAttribute
130.98ms     226ns  cuDeviceGetAttribute
130.98ms     220ns  cuDeviceGetAttribute
130.98ms     212ns  cuDeviceGetAttribute
130.98ms     210ns  cuDeviceGetAttribute
130.98ms     206ns  cuDeviceGetAttribute
130.98ms     207ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms     211ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     208ns  cuDeviceGetAttribute
130.98ms     229ns  cuDeviceGetAttribute
130.98ms     215ns  cuDeviceGetAttribute
130.98ms     216ns  cuDeviceGetAttribute
130.98ms     209ns  cuDeviceGetAttribute
130.98ms  316.59us  cuDeviceGetAttribute
131.30ms     266ns  cuDeviceGetAttribute
131.30ms     252ns  cuDeviceGetAttribute
131.30ms     212ns  cuDeviceGetAttribute
131.30ms     235ns  cuDeviceGetAttribute
131.30ms     209ns  cuDeviceGetAttribute
131.30ms     272ns  cuDeviceGetAttribute
131.30ms     207ns  cuDeviceGetAttribute
131.30ms     735ns  cuDeviceGetAttribute
131.30ms     254ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     208ns  cuDeviceGetAttribute
131.30ms     610ns  cuDeviceGetAttribute
131.31ms     273ns  cuDeviceGetAttribute
131.31ms     412ns  cuDeviceGetAttribute
131.31ms     216ns  cuDeviceGetAttribute
131.31ms     211ns  cuDeviceGetAttribute
131.31ms     205ns  cuDeviceGetAttribute
131.31ms  59.911ms  cudaLaunchKernel (k(void) [105])
191.23ms  11.222us  cudaLaunchKernel (k(void) [106])
191.24ms  5.7860us  cudaDeviceSynchronize
191.25ms  9.2890us  cudaLaunchKernel (k(void) [108])
191.26ms  5.1790us  cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*           Device   Context    Stream  Name
191.20ms  1.6000us              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [105]
191.22ms     896ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [106]
191.23ms     928ns              (1 1 1)         (1 1 1)         8        0B        0B  Quadro K2000 (0         1         7  k(void) [108]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$