运行时未调用Cuda内核
我正在尝试使用Cuda内核计算一个简单的矩阵加法。代码如下-运行时未调用Cuda内核,cuda,Cuda,我正在尝试使用Cuda内核计算一个简单的矩阵加法。代码如下- int main(){ unsigned int rows = 16384; unsigned int columns = 4096; int size = rows * columns * sizeof(float ); float* matA, *matB, *out; float* d_matA, *d_matB, *d_out; matA = (float*) mallo
int main(){
unsigned int rows = 16384;
unsigned int columns = 4096;
int size = rows * columns * sizeof(float );
float* matA, *matB, *out;
float* d_matA, *d_matB, *d_out;
matA = (float*) malloc(size);
matB = (float*) malloc(size);
out = (float *) malloc(size);
#pragma omp parallel for simd
for (int i = 0; i < rows * columns; i++) {
*(matA + i) = 1.0f;
*(matB + i) = 2.0f;
*(out + i) = 45.0f;
}
cudaMalloc(&d_matA, size);
cudaMalloc(&d_matB, size);
cudaMalloc(&d_out, size);
cudaMemcpy(d_matA, matA, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_matB, matB, size, cudaMemcpyHostToDevice);
dim3 threads(1024, 1024);
dim3 blocks(int(rows/threads.x), int(columns/threads.y));
//fillMatrix<<<blocks, threads>>>(d_matA ,d_matB, d_out, rows, columns);
matrixAdd2D<<<blocks, threads>>>(d_matA, d_matB, d_out, rows, columns);
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
for (int i = 0; i< 25; i++)
std::cout<<*(out + i)<<" ";
std::cout<<std::endl;
}
==86597== NVPROF is profiling process 86597, command: ./a.out
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
==86597== Profiling application: ./a.out
==86597== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 67.70% 84.207ms 2 42.104ms 41.868ms 42.339ms [CUDA memcpy HtoD]
32.30% 40.181ms 1 40.181ms 40.181ms 40.181ms [CUDA memcpy DtoH]
API calls: 50.38% 126.86ms 3 42.285ms 221.17us 126.40ms cudaMalloc
49.50% 124.65ms 3 41.549ms 40.492ms 42.215ms cudaMemcpy
0.06% 152.63us 1 152.63us 152.63us 152.63us cuDeviceTotalMem
0.05% 115.17us 101 1.1400us 86ns 49.268us cuDeviceGetAttribute
0.02% 40.352us 1 40.352us 40.352us 40.352us cuDeviceGetName
0.00% 7.5130us 1 7.5130us 7.5130us 7.5130us cuDeviceGetPCIBusId
0.00% 887ns 3 295ns 141ns 598ns cuDeviceGetCount
0.00% 645ns 2 322ns 104ns 541ns cuDeviceGet
0.00% 517ns 1 517ns 517ns 517ns cudaLaunchKernel
0.00% 199ns 1 199ns 199ns 199ns cuDeviceGetUuid
如您所见,我的内核matrixAdd2D甚至没有被调用。我哪里做错了
您有一个非法的块大小…感谢您回复@talonmies,您能详细说明一下吗?没有什么需要详细说明的。你的内核从未运行过,因为你有一个非法的块大小。你能告诉我@talonmies的合法大小是多少吗。我不能在每个维度上运行一个32大小的线程,我发布的代码只是我在尝试一些极端的东西,因为没有任何东西起作用,即使我这样做了,它也不起作用。我知道一个块最多可以有1024个线程。我将编辑我发布的代码如果您知道每个块有1024个线程的限制,为什么您认为1024x1024或64x64块可以工作?
==86597== NVPROF is profiling process 86597, command: ./a.out
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
==86597== Profiling application: ./a.out
==86597== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 67.70% 84.207ms 2 42.104ms 41.868ms 42.339ms [CUDA memcpy HtoD]
32.30% 40.181ms 1 40.181ms 40.181ms 40.181ms [CUDA memcpy DtoH]
API calls: 50.38% 126.86ms 3 42.285ms 221.17us 126.40ms cudaMalloc
49.50% 124.65ms 3 41.549ms 40.492ms 42.215ms cudaMemcpy
0.06% 152.63us 1 152.63us 152.63us 152.63us cuDeviceTotalMem
0.05% 115.17us 101 1.1400us 86ns 49.268us cuDeviceGetAttribute
0.02% 40.352us 1 40.352us 40.352us 40.352us cuDeviceGetName
0.00% 7.5130us 1 7.5130us 7.5130us 7.5130us cuDeviceGetPCIBusId
0.00% 887ns 3 295ns 141ns 598ns cuDeviceGetCount
0.00% 645ns 2 322ns 104ns 541ns cuDeviceGet
0.00% 517ns 1 517ns 517ns 517ns cudaLaunchKernel
0.00% 199ns 1 199ns 199ns 199ns cuDeviceGetUuid