使用nvprof统计CUDA内核执行情况
是否可以使用nvprof计算CUDA内核执行的数量,即启动了多少个内核 现在当我跑步时,我看到的是:使用nvprof统计CUDA内核执行情况,cuda,nvprof,Cuda,Nvprof,是否可以使用nvprof计算CUDA内核执行的数量,即启动了多少个内核 现在当我跑步时,我看到的是: ==537== Profiling application: python tf.py ==537== Profiling result: Time(%) Time Calls Avg Min Max Name 51.73% 91.294us 20 4.5640us 4.1280us 6.1760us [CUDA
==537== Profiling application: python tf.py
==537== Profiling result:
Time(%) Time Calls Avg Min Max Name
51.73% 91.294us 20 4.5640us 4.1280us 6.1760us [CUDA memcpy HtoD]
43.72% 77.148us 20 3.8570us 3.5840us 4.7030us [CUDA memcpy DtoH]
4.55% 8.0320us 1 8.0320us 8.0320us 8.0320us [CUDA memset]
==537== API calls:
Time(%) Time Calls Avg Min Max Name
90.17% 110.11ms 1 110.11ms 110.11ms 110.11ms cuDevicePrimaryCtxRetain
6.63% 8.0905ms 1 8.0905ms 8.0905ms 8.0905ms cuMemAlloc
0.57% 700.41us 2 350.21us 346.89us 353.52us cuMemGetInfo
0.55% 670.28us 1 670.28us 670.28us 670.28us cuMemHostAlloc
0.28% 347.01us 1 347.01us 347.01us 347.01us cuDeviceTotalMem
...
是的,这是可能的。如果您不知道,nvprof-help中提供了和命令行帮助 您需要的是nvprof的最简单用法: nvprof./my_应用程序 这将输出一个内核列表,其中包括名称、每个内核的启动次数以及每个内核占GPU总使用量的百分比 下面是一个例子:
$ nvprof ./t1288
==12904== NVPROF is profiling process 12904, command: ./t1288
addr@host: 0x402add
addr@device: 0x8
run on device
func_A is correctly invoked!
run on host
func_A is correctly invoked!
==12904== Profiling application: ./t1288
==12904== Profiling result:
Time(%) Time Calls Avg Min Max Name
98.93% 195.28us 1 195.28us 195.28us 195.28us run_on_device(Parameters*)
1.07% 2.1120us 1 2.1120us 2.1120us 2.1120us assign_func_pointer(Parameters*)
==12904== Unified Memory profiling result:
Device "Tesla K20Xm (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
1 4.0000KB 4.0000KB 4.0000KB 4.000000KB 3.136000us Host To Device
6 32.000KB 4.0000KB 60.000KB 192.0000KB 34.20800us Device To Host
Total CPU Page faults: 3
==12904== API calls:
Time(%) Time Calls Avg Min Max Name
98.08% 321.35ms 1 321.35ms 321.35ms 321.35ms cudaMallocManaged
0.93% 3.0613ms 364 8.4100us 278ns 286.84us cuDeviceGetAttribute
0.42% 1.3626ms 4 340.65us 331.12us 355.60us cuDeviceTotalMem
0.38% 1.2391ms 2 619.57us 113.13us 1.1260ms cudaLaunch
0.08% 251.20us 4 62.798us 57.985us 70.827us cuDeviceGetName
0.08% 246.55us 2 123.27us 21.343us 225.20us cudaDeviceSynchronize
0.03% 98.950us 1 98.950us 98.950us 98.950us cudaFree
0.00% 8.9820us 12 748ns 278ns 2.2670us cuDeviceGet
0.00% 6.0260us 2 3.0130us 613ns 5.4130us cudaSetupArgument
0.00% 5.7190us 3 1.9060us 490ns 4.1130us cuDeviceGetCount
0.00% 5.2370us 2 2.6180us 1.2100us 4.0270us cudaConfigureCall
$
在上面的示例中,在设备上运行和分配函数指针是内核名。我链接的文档中也有示例输出。我用运行nprof时看到的内容更新了问题。我没有看到任何被称为内核的东西。我可以想到两种可能性:1。您的python代码没有进行任何成功的内核调用-您是否正在进行正确的错误检查?你确定内核被调用了吗?2.您可能需要告诉nvprof评测子进程-如何实现这一点在我链接的文档中有介绍。这将取决于您在tf.py中发布的具体工作类型-可能是tensorflow。好的,结果是没有调用内核。