Cuda 如何解释方括号中的数字?
内核名称后方括号中显示的数字与启动该内核的CUDA API相关。() 内核名称后方括号中显示的数字为Cuda 如何解释方括号中的数字?,cuda,profiling,nvidia,nvprof,Cuda,Profiling,Nvidia,Nvprof,内核名称后方括号中显示的数字与启动该内核的CUDA API相关。() 内核名称后方括号中显示的数字为 94, 105, 2191年 2198年 那么CUDA API[94](和其他)到底是什么 ==27706==分析应用程序:matrixMul [使用CUDA进行矩阵乘法]-正在启动。。。 GPU设备0:“GeForce GT 640M LE”,具有3.0计算能力 MatrixA(320320),MatrixB(640320) 使用CUDA内核的计算结果。。。 完成 性能=35.36 GF
- 94,
- 105,
- 2191年
- 2198年
==27706==分析应用程序:matrixMul
[使用CUDA进行矩阵乘法]-正在启动。。。
GPU设备0:“GeForce GT 640M LE”,具有3.0计算能力
MatrixA(320320),MatrixB(640320)
使用CUDA内核的计算结果。。。
完成
性能=35.36 GFlop/s,时间=3.707毫秒,大小=131072000次操作,工作组大小=1024个线程/块
检查计算结果的正确性:OK
注:有关峰值性能,请参考matrixMulCUBLAS示例。
==27706==分析结果:
开始持续时间网格大小块大小Regs*SSMem*DSMem*大小吞吐量设备上下文流名称
133.81ms 135.78us----409.60KB 3.0167GB/s GeForce GT 640M 1 2[CUDA memcpy HtoD]
134.62ms 270.66us----819.20KB 3.0267GB/s GeForce GT 640M 1 2[CUDA memcpy HtoD]
134.90ms 3.7037ms(20 10 1)(32 32 1)29 8.1920KB 0B——GeForce GT 640M 1 2无效矩阵ulcuda(浮点*,浮点*,浮点*,整数,整数)[94]
138.71ms 3.7011ms(20 10 1)(32 32 1)29 8.1920KB 0B——GeForce GT 640M 1 2无效矩阵ulcuda(浮点*,浮点*,浮点*,整数,整数)[105]
1.24341s 3.7011ms(20 10 1)(32 32 1)29 8.1920KB 0B--GeForce GT 640M 1 2无效矩阵MULCUDA(浮点*,浮点*,浮点*,整数,整数)[2191]
1.24711s 3.7046ms(20 10 1)(32 32 1)29 8.1920KB 0B--GeForce GT 640M 1 2无效矩阵ulcuda(浮点*,浮点*,浮点*,整数,整数)[2198]
1.25089s 248.13U----819.20KB 3.3015GB/s GeForce GT 640M 1 2[CUDA memcpy DtoH]
Regs:每个CUDA线程使用的寄存器数。这个数字包括CUDA驱动程序和/或工具在内部使用的寄存器,可能比编译器显示的要多。
SSMem:每个CUDA块分配的静态共享内存。
DSTEM:为每个CUDA块分配的动态共享内存。
如果它说:
内核名称后方括号中显示的数字与启动该内核的CUDA API调用相关
如果您使用--print api trace
选项运行给定的代码,您将获得该应用程序发出的所有CUDA api调用的顺序列表。如果要按顺序对它们进行编号,则与特定内核启动相关的编号将显示在--print gpu trace
输出的方括号中
这是一个充分发挥作用的例子。注意api跟踪输出和gpu跟踪输出中[105]
、[106]
和[108]
之间的相关性:
$ cat t1.cu
__global__ void k(){}
int main(){
k<<<1,1>>>();
k<<<1,1>>>();
cudaDeviceSynchronize();
k<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
Start Duration Name
116.17ms 3.0990us cuDeviceGetPCIBusId
130.20ms 800ns cuDeviceGetCount
130.20ms 251ns cuDeviceGetCount
130.41ms 1.0500us cuDeviceGet
130.41ms 705ns cuDeviceGetAttribute
130.42ms 539ns cuDeviceGetAttribute
130.42ms 547ns cuDeviceGetAttribute
130.46ms 525ns cuDeviceGetCount
130.46ms 277ns cuDeviceGet
130.46ms 59.680us cuDeviceGetName
130.52ms 63.802us cuDeviceTotalMem
130.59ms 497ns cuDeviceGetAttribute
130.59ms 226ns cuDeviceGetAttribute
130.59ms 282ns cuDeviceGetAttribute
130.59ms 234ns cuDeviceGetAttribute
130.59ms 229ns cuDeviceGetAttribute
130.59ms 34.628us cuDeviceGetAttribute
130.62ms 372ns cuDeviceGetAttribute
130.63ms 220ns cuDeviceGetAttribute
130.63ms 284ns cuDeviceGetAttribute
130.63ms 237ns cuDeviceGetAttribute
130.63ms 222ns cuDeviceGetAttribute
130.63ms 231ns cuDeviceGetAttribute
130.63ms 288ns cuDeviceGetAttribute
130.63ms 219ns cuDeviceGetAttribute
130.63ms 3.1870us cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 275ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 213ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 336ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 213ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 210ns cuDeviceGetAttribute
130.65ms 215ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 320.65us cuDeviceGetAttribute
130.97ms 322ns cuDeviceGetAttribute
130.97ms 206ns cuDeviceGetAttribute
130.97ms 218ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.98ms 226ns cuDeviceGetAttribute
130.98ms 220ns cuDeviceGetAttribute
130.98ms 212ns cuDeviceGetAttribute
130.98ms 210ns cuDeviceGetAttribute
130.98ms 206ns cuDeviceGetAttribute
130.98ms 207ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 211ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 229ns cuDeviceGetAttribute
130.98ms 215ns cuDeviceGetAttribute
130.98ms 216ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 316.59us cuDeviceGetAttribute
131.30ms 266ns cuDeviceGetAttribute
131.30ms 252ns cuDeviceGetAttribute
131.30ms 212ns cuDeviceGetAttribute
131.30ms 235ns cuDeviceGetAttribute
131.30ms 209ns cuDeviceGetAttribute
131.30ms 272ns cuDeviceGetAttribute
131.30ms 207ns cuDeviceGetAttribute
131.30ms 735ns cuDeviceGetAttribute
131.30ms 254ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 610ns cuDeviceGetAttribute
131.31ms 273ns cuDeviceGetAttribute
131.31ms 412ns cuDeviceGetAttribute
131.31ms 216ns cuDeviceGetAttribute
131.31ms 211ns cuDeviceGetAttribute
131.31ms 205ns cuDeviceGetAttribute
131.31ms 59.911ms cudaLaunchKernel (k(void) [105])
191.23ms 11.222us cudaLaunchKernel (k(void) [106])
191.24ms 5.7860us cudaDeviceSynchronize
191.25ms 9.2890us cudaLaunchKernel (k(void) [108])
191.26ms 5.1790us cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
191.20ms 1.6000us (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [105]
191.22ms 896ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [106]
191.23ms 928ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [108]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
$cat t1.cu
__全局_u; void k(){}
int main(){
k();
k();
cudaDeviceSynchronize();
k();
cudaDeviceSynchronize();
}
$nvcc-o t1.cu
$nvprof—打印api跟踪。/t1
==7206==NVPROF正在分析进程7206,命令:./t1
==7206==分析应用程序:./t1
==7206==分析结果:
开始持续时间名称
116.17ms 3.0990us cuDeviceGetPCIBusId
130.20ms 800ns cuDeviceGetCount
130.20ms 251ns cuDeviceGetCount
130.41ms 1.0500us cuDeviceGet
130.41ms 705ns cuDeviceGetAttribute
130.42ms 539ns cuDeviceGetAttribute
130.42ms 547ns cuDeviceGetAttribute
130.46ms 525ns cuDeviceGetCount
130.46ms 277ns cuDeviceGet
130.46ms 59.680us cuDeviceGetName
130.52ms 63.802us CUDeviceTotalem
130.59ms 497ns cuDeviceGetAttribute
130.59ms 226ns cuDeviceGetAttribute
130.59ms 282ns cuDeviceGetAttribute
130.59ms 234ns cuDeviceGetAttribute
130.59ms 229ns cuDeviceGetAttribute
130.59ms 34.628us cuDeviceGetAttribute
130.62ms 372ns cuDeviceGetAttribute
130.63ms 220ns cuDeviceGetAttribute
130.63ms 284ns cuDeviceGetAttribute
130.63ms 237ns cuDeviceGetAttribute
130.63ms 222ns cuDeviceGetAttribute
130.63ms 231ns cuDeviceGetAttribute
130.63ms 288ns cuDeviceGetAttribute
130.63ms 219ns cuDeviceGetAttribute
130.63ms 3.1870us cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 275ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 213ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 336ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribu
$ cat t1.cu
__global__ void k(){}
int main(){
k<<<1,1>>>();
k<<<1,1>>>();
cudaDeviceSynchronize();
k<<<1,1>>>();
cudaDeviceSynchronize();
}
$ nvcc -o t1 t1.cu
$ nvprof --print-api-trace ./t1
==7206== NVPROF is profiling process 7206, command: ./t1
==7206== Profiling application: ./t1
==7206== Profiling result:
Start Duration Name
116.17ms 3.0990us cuDeviceGetPCIBusId
130.20ms 800ns cuDeviceGetCount
130.20ms 251ns cuDeviceGetCount
130.41ms 1.0500us cuDeviceGet
130.41ms 705ns cuDeviceGetAttribute
130.42ms 539ns cuDeviceGetAttribute
130.42ms 547ns cuDeviceGetAttribute
130.46ms 525ns cuDeviceGetCount
130.46ms 277ns cuDeviceGet
130.46ms 59.680us cuDeviceGetName
130.52ms 63.802us cuDeviceTotalMem
130.59ms 497ns cuDeviceGetAttribute
130.59ms 226ns cuDeviceGetAttribute
130.59ms 282ns cuDeviceGetAttribute
130.59ms 234ns cuDeviceGetAttribute
130.59ms 229ns cuDeviceGetAttribute
130.59ms 34.628us cuDeviceGetAttribute
130.62ms 372ns cuDeviceGetAttribute
130.63ms 220ns cuDeviceGetAttribute
130.63ms 284ns cuDeviceGetAttribute
130.63ms 237ns cuDeviceGetAttribute
130.63ms 222ns cuDeviceGetAttribute
130.63ms 231ns cuDeviceGetAttribute
130.63ms 288ns cuDeviceGetAttribute
130.63ms 219ns cuDeviceGetAttribute
130.63ms 3.1870us cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 275ns cuDeviceGetAttribute
130.63ms 211ns cuDeviceGetAttribute
130.63ms 213ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 336ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 210ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 216ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 212ns cuDeviceGetAttribute
130.64ms 214ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.64ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 213ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 211ns cuDeviceGetAttribute
130.65ms 210ns cuDeviceGetAttribute
130.65ms 215ns cuDeviceGetAttribute
130.65ms 212ns cuDeviceGetAttribute
130.65ms 320.65us cuDeviceGetAttribute
130.97ms 322ns cuDeviceGetAttribute
130.97ms 206ns cuDeviceGetAttribute
130.97ms 218ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.97ms 212ns cuDeviceGetAttribute
130.98ms 226ns cuDeviceGetAttribute
130.98ms 220ns cuDeviceGetAttribute
130.98ms 212ns cuDeviceGetAttribute
130.98ms 210ns cuDeviceGetAttribute
130.98ms 206ns cuDeviceGetAttribute
130.98ms 207ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 211ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 208ns cuDeviceGetAttribute
130.98ms 229ns cuDeviceGetAttribute
130.98ms 215ns cuDeviceGetAttribute
130.98ms 216ns cuDeviceGetAttribute
130.98ms 209ns cuDeviceGetAttribute
130.98ms 316.59us cuDeviceGetAttribute
131.30ms 266ns cuDeviceGetAttribute
131.30ms 252ns cuDeviceGetAttribute
131.30ms 212ns cuDeviceGetAttribute
131.30ms 235ns cuDeviceGetAttribute
131.30ms 209ns cuDeviceGetAttribute
131.30ms 272ns cuDeviceGetAttribute
131.30ms 207ns cuDeviceGetAttribute
131.30ms 735ns cuDeviceGetAttribute
131.30ms 254ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 208ns cuDeviceGetAttribute
131.30ms 610ns cuDeviceGetAttribute
131.31ms 273ns cuDeviceGetAttribute
131.31ms 412ns cuDeviceGetAttribute
131.31ms 216ns cuDeviceGetAttribute
131.31ms 211ns cuDeviceGetAttribute
131.31ms 205ns cuDeviceGetAttribute
131.31ms 59.911ms cudaLaunchKernel (k(void) [105])
191.23ms 11.222us cudaLaunchKernel (k(void) [106])
191.24ms 5.7860us cudaDeviceSynchronize
191.25ms 9.2890us cudaLaunchKernel (k(void) [108])
191.26ms 5.1790us cudaDeviceSynchronize
$ nvprof --print-gpu-trace ./t1
==7224== NVPROF is profiling process 7224, command: ./t1
==7224== Profiling application: ./t1
==7224== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Device Context Stream Name
191.20ms 1.6000us (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [105]
191.22ms 896ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [106]
191.23ms 928ns (1 1 1) (1 1 1) 8 0B 0B Quadro K2000 (0 1 7 k(void) [108]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$