Memory NVPROF报告的交易指标究竟是什么?
我试图弄清楚“nvprof”报告的每一个指标到底是什么。更具体地说,我无法确定哪些事务是系统内存和设备内存读写。我写了一个非常基本的代码来帮助解决这个问题Memory NVPROF报告的交易指标究竟是什么?,memory,cuda,gpu,profiler,nvprof,Memory,Cuda,Gpu,Profiler,Nvprof,我试图弄清楚“nvprof”报告的每一个指标到底是什么。更具体地说,我无法确定哪些事务是系统内存和设备内存读写。我写了一个非常基本的代码来帮助解决这个问题 #define TYPE float #define BDIMX 16 #define BDIMY 16 #include <cuda.h> #include <cstdio> #include <iostream> __global__ void kernel(TYPE *g_output, TYPE *
#define TYPE float
#define BDIMX 16
#define BDIMY 16
#include <cuda.h>
#include <cstdio>
#include <iostream>
__global__ void kernel(TYPE *g_output, TYPE *g_input, const int dimx, const int dimy)
{
__shared__ float s_data[BDIMY][BDIMX];
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
int in_idx = iy * dimx + ix; // index for reading input
int tx = threadIdx.x; // thread’s x-index into corresponding shared memory tile
int ty = threadIdx.y; // thread’s y-index into corresponding shared memory tile
s_data[ty][tx] = g_input[in_idx];
__syncthreads();
g_output[in_idx] = s_data[ty][tx] * 1.3;
}
int main(){
int size_x = 16, size_y = 16;
dim3 numTB;
numTB.x = (int)ceil((double)(size_x)/(double)BDIMX) ;
numTB.y = (int)ceil((double)(size_y)/(double)BDIMY) ;
dim3 tbSize;
tbSize.x = BDIMX;
tbSize.y = BDIMY;
float* a,* a_out;
float *a_d = (float *) malloc(size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a, size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a_out, size_x * size_y * sizeof(TYPE));
for(int index = 0; index < size_x * size_y; index++){
a_d[index] = index;
}
cudaMemcpy(a, a_d, size_x * size_y * sizeof(TYPE), cudaMemcpyHostToDevice);
kernel <<<numTB, tbSize>>>(a_out, a, size_x, size_y);
cudaDeviceSynchronize();
return 0;
}
我了解共享和全局访问。全局访问是合并的,因为有8个扭曲,所以有8个事务。
但我无法计算出系统内存和设备内存写入事务号 如果您有一个GPU内存层次结构模型,该模型同时包含逻辑和物理空间,例如一个 参考“概览选项卡”图:
也可能有兴趣。谢谢您的回答。这个数字有点帮助。但我仍然有一些问题。所以我不明白为什么这段代码有4个系统内存事务,其余的dram写事务来自哪里?这是一个非常直截了当的代码,所以我不希望有任何未知的事务!试图解释1扭曲的计数器是危险的。GPU有多个引擎和执行单元与着色器同时运行。随着计数器距离SM越来越远,过滤计数器的能力降低,您可以看到这些其他单位的增量。我建议启动1个扭曲的SM计数块和2个1个扭曲的SM计数块,并查看计数器是否相应缩放。
Metric Name Metric Description Min Max Avg
Device "Tesla K40c (0)"
Kernel: kernel(float*, float*, int, int)
local_load_transactions Local Load Transactions 0 0 0
local_store_transactions Local Store Transactions 0 0 0
shared_load_transactions Shared Load Transactions 8 8 8
shared_store_transactions Shared Store Transactions 8 8 8
gld_transactions Global Load Transactions 8 8 8
gst_transactions Global Store Transactions 8 8 8
sysmem_read_transactions System Memory Read Transactions 0 0 0
sysmem_write_transactions System Memory Write Transactions 4 4 4
tex_cache_transactions Texture Cache Transactions 0 0 0
dram_read_transactions Device Memory Read Transactions 0 0 0
dram_write_transactions Device Memory Write Transactions 40 40 40
l2_read_transactions L2 Read Transactions 70 70 70
l2_write_transactions L2 Write Transactions 46 46 46