运行时未调用Cuda内核

运行时未调用Cuda内核,cuda,Cuda,我正在尝试使用Cuda内核计算一个简单的矩阵加法。代码如下- int main(){ unsigned int rows = 16384; unsigned int columns = 4096; int size = rows * columns * sizeof(float ); float* matA, *matB, *out; float* d_matA, *d_matB, *d_out; matA = (float*) mallo

我正在尝试使用Cuda内核计算一个简单的矩阵加法。代码如下-

int main(){
    unsigned int rows = 16384;
    unsigned int columns = 4096;

    int size = rows * columns * sizeof(float );

    float* matA, *matB, *out;
    float* d_matA, *d_matB, *d_out;

    matA  = (float*) malloc(size);
    matB = (float*) malloc(size);
    out = (float *) malloc(size);

#pragma omp parallel for simd
    for (int i = 0; i < rows * columns; i++) {
        *(matA + i) = 1.0f;
        *(matB + i) = 2.0f;
        *(out + i) = 45.0f;
    }

    cudaMalloc(&d_matA, size);
    cudaMalloc(&d_matB, size);
    cudaMalloc(&d_out, size);

    cudaMemcpy(d_matA, matA, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_matB, matB, size, cudaMemcpyHostToDevice);

    dim3 threads(1024, 1024);
    dim3 blocks(int(rows/threads.x), int(columns/threads.y));

    //fillMatrix<<<blocks, threads>>>(d_matA ,d_matB, d_out, rows, columns);
    matrixAdd2D<<<blocks, threads>>>(d_matA, d_matB, d_out, rows, columns);

    cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);

    for (int i = 0; i< 25; i++)
        std::cout<<*(out + i)<<"  ";

    std::cout<<std::endl;

}
==86597== NVPROF is profiling process 86597, command: ./a.out
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  
==86597== Profiling application: ./a.out
==86597== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   67.70%  84.207ms         2  42.104ms  41.868ms  42.339ms  [CUDA memcpy HtoD]
                   32.30%  40.181ms         1  40.181ms  40.181ms  40.181ms  [CUDA memcpy DtoH]
      API calls:   50.38%  126.86ms         3  42.285ms  221.17us  126.40ms  cudaMalloc
                   49.50%  124.65ms         3  41.549ms  40.492ms  42.215ms  cudaMemcpy
                    0.06%  152.63us         1  152.63us  152.63us  152.63us  cuDeviceTotalMem
                    0.05%  115.17us       101  1.1400us      86ns  49.268us  cuDeviceGetAttribute
                    0.02%  40.352us         1  40.352us  40.352us  40.352us  cuDeviceGetName
                    0.00%  7.5130us         1  7.5130us  7.5130us  7.5130us  cuDeviceGetPCIBusId
                    0.00%     887ns         3     295ns     141ns     598ns  cuDeviceGetCount
                    0.00%     645ns         2     322ns     104ns     541ns  cuDeviceGet
                    0.00%     517ns         1     517ns     517ns     517ns  cudaLaunchKernel
                    0.00%     199ns         1     199ns     199ns     199ns  cuDeviceGetUuid


如您所见,我的内核matrixAdd2D甚至没有被调用。我哪里做错了

您有一个非法的块大小…感谢您回复@talonmies,您能详细说明一下吗?没有什么需要详细说明的。你的内核从未运行过,因为你有一个非法的块大小。你能告诉我@talonmies的合法大小是多少吗。我不能在每个维度上运行一个32大小的线程,我发布的代码只是我在尝试一些极端的东西,因为没有任何东西起作用,即使我这样做了,它也不起作用。我知道一个块最多可以有1024个线程。我将编辑我发布的代码如果您知道每个块有1024个线程的限制,为什么您认为1024x1024或64x64块可以工作?
==86597== NVPROF is profiling process 86597, command: ./a.out
0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  
==86597== Profiling application: ./a.out
==86597== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   67.70%  84.207ms         2  42.104ms  41.868ms  42.339ms  [CUDA memcpy HtoD]
                   32.30%  40.181ms         1  40.181ms  40.181ms  40.181ms  [CUDA memcpy DtoH]
      API calls:   50.38%  126.86ms         3  42.285ms  221.17us  126.40ms  cudaMalloc
                   49.50%  124.65ms         3  41.549ms  40.492ms  42.215ms  cudaMemcpy
                    0.06%  152.63us         1  152.63us  152.63us  152.63us  cuDeviceTotalMem
                    0.05%  115.17us       101  1.1400us      86ns  49.268us  cuDeviceGetAttribute
                    0.02%  40.352us         1  40.352us  40.352us  40.352us  cuDeviceGetName
                    0.00%  7.5130us         1  7.5130us  7.5130us  7.5130us  cuDeviceGetPCIBusId
                    0.00%     887ns         3     295ns     141ns     598ns  cuDeviceGetCount
                    0.00%     645ns         2     322ns     104ns     541ns  cuDeviceGet
                    0.00%     517ns         1     517ns     517ns     517ns  cudaLaunchKernel
                    0.00%     199ns         1     199ns     199ns     199ns  cuDeviceGetUuid