cuda线程和块我把这个贴在英伟达论坛上，我想我可以多看几眼。_Cuda

cuda线程和块我把这个贴在英伟达论坛上，我想我可以多看几眼。

cuda

cuda线程和块我把这个贴在英伟达论坛上，我想我可以多看几眼。,cuda,Cuda,我在尝试扩展代码以执行多个案例时遇到问题。我一直在开发最常见的案例，现在是测试的时候了，我需要确保它适用于不同的案例。目前，我的内核是在一个循环中执行的（我们不执行一个内核调用来完成整个任务是有原因的），以计算矩阵行中的值。最常见的情况是512列乘512行。我需要考虑大小为512×512, 1024×512, 512×1024的组合，以及其他组合，但最大的将是1024×1024矩阵。我一直在使用一个相当简单的内核调用： launchKernel<<<1,512>>&

我在尝试扩展代码以执行多个案例时遇到问题。我一直在开发最常见的案例，现在是测试的时候了，我需要确保它适用于不同的案例。目前，我的内核是在一个循环中执行的（我们不执行一个内核调用来完成整个任务是有原因的），以计算矩阵行中的值。最常见的情况是512列乘512行。我需要考虑大小为512×512, 1024×512, 512×1024的组合，以及其他组合，但最大的将是1024×1024矩阵。我一直在使用一个相当简单的内核调用：

launchKernel<<<1,512>>>(................)

launchKernel（……）

该内核适用于常见的512x512和512 x 1024（分别为列和行）情况，但不适用于1024 x 512情况。这种情况需要1024个线程来执行。在我的天真中，我一直在尝试不同版本的简单内核调用来启动1024个线程

launchKernel<<<2,512>>>(................)  // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???

launchKernel（…..）//2个块，每个块有512个线程？？？
launchKernel（……..//1个具有1024个线程的块？？？

我相信我的问题与我对线程和块缺乏理解有关

这里是deviceQuery的输出，正如您所看到的，我最多可以有1024个线程

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Found 2 CUDA Capable device(s)

Device 0: "Tesla C2050"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.0
  Total amount of global memory:                 2688 MBytes (2818572288 bytes)
  (14) Multiprocessors x (32) CUDA Cores/MP:     448 CUDA Cores
  GPU Clock Speed:                               1.15 GHz
  Memory Clock rate:                             1500.00 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 786432 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                Yes
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           40 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Device 1: "Quadro 600"
  CUDA Driver Version / Runtime Version          4.2 / 4.1
  CUDA Capability Major/Minor version number:    2.1
  Total amount of global memory:                 1024 MBytes (1073741824 bytes)
  ( 2) Multiprocessors x (48) CUDA Cores/MP:     96 CUDA Cores
  GPU Clock Speed:                               1.28 GHz
  Memory Clock rate:                             800.00 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 131072 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 32768
  Warp size:                                     32
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and execution:                 Yes with 1 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Concurrent kernel execution:                   Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support enabled:                No
  Device is using TCC driver mode:               No
  Device supports Unified Addressing (UVA):      No
  Device PCI Bus ID / PCI location ID:           15 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600

C:\ProgramData\NVIDIA Corporation\NVIDIA GPU计算SDK 4.1\C\bin\win64\Release\deviceQuery.exe正在启动。。。
CUDA设备查询（运行时API）版本（CUDART静态链接）
找到2个支持CUDA的设备
设备0：“特斯拉C2050”
CUDA驱动程序版本/运行时版本4.2/4.1
CUDA能力主要/次要版本号：2.0
全局内存总量：2688 MB（2818572288字节）
（14） 多处理器x（32）个CUDA内核/MP:448个CUDA内核
GPU时钟速度：1.15 GHz
内存时钟频率：1500.00 Mhz
内存总线宽度：384位
二级缓存大小：786432字节
最大纹理标注大小（x、y、z）1D=（65536）、2D=（6553665535）、3D=（204820482048）
最大分层纹理大小（dim）x层1D=（16384）x 2048，2D=（1638416384）x 2048
恒定内存总量：65536字节
每个块的共享内存总量：49152字节
每个块可用的寄存器总数：32768
经纱尺寸：32
每个块的最大线程数：1024
块的每个维度的最大大小：1024 x 1024 x 64
网格各尺寸的最大尺寸：65535 x 65535 x 65535
最大内存间距：2147483647字节
纹理对齐：512字节
并发复制和执行：有2个复制引擎
内核的运行时间限制：是
集成GPU共享主机内存：否
支持主机页锁定内存映射：是
并发内核执行：是
表面对齐要求：是
设备已启用ECC支持：是
设备正在使用变矩器离合器驱动模式：否
设备支持统一寻址（UVA）：否
设备PCI总线ID/PCI位置ID:40/0
计算模式：

装置1：“Quadro 600”
CUDA驱动程序版本/运行时版本4.2/4.1
CUDA能力主要/次要版本号：2.1
全局内存总量：1024 MB（1073741824字节）
（2）多处理器x（48）个CUDA核/MP:96个CUDA核
GPU时钟速度：1.28 GHz
内存时钟频率：800.00 Mhz
内存总线宽度：128位
二级缓存大小：131072字节
最大纹理标注大小（x、y、z）1D=（65536）、2D=（6553665535）、3D=（204820482048）
最大分层纹理大小（dim）x层1D=（16384）x 2048，2D=（1638416384）x 2048
恒定内存总量：65536字节
每个块的共享内存总量：49152字节
每个块可用的寄存器总数：32768
经纱尺寸：32
每个块的最大线程数：1024
块的每个维度的最大大小：1024 x 1024 x 64
网格各尺寸的最大尺寸：65535 x 65535 x 65535
最大内存间距：2147483647字节
纹理对齐：512字节
并发复制和执行：是，使用1个复制引擎
内核的运行时间限制：是
集成GPU共享主机内存：否
支持主机页锁定内存映射：是
并发内核执行：是
表面对齐要求：是
设备已启用ECC支持：否
设备正在使用变矩器离合器驱动模式：否
设备支持统一寻址（UVA）：否
设备PCI总线ID/PCI位置ID:15/0
计算模式：

deviceQuery，CUDA驱动程序=CUDART，CUDA驱动程序版本=4.2，CUDA运行时版本=4.1，NumDevs=2，设备=Tesla C2050，设备=Quadro 600

我只使用特斯拉C2050设备这是我的内核的一个精简版本，因此您对它的功能有了一个概念

#define twoPi               6.283185307179586
#define speed_of_light      3.0E8
#define MaxSize             999

__global__ void calcRx4CPP4
(  
        const float *array1,  
        const double *array2,  
        const float scalar1,  
        const float scalar2,  
        const float scalar3,  
        const float scalar4,  
        const float scalar5,  
        const float scalar6,  
        const int scalar7,  
        const int scalar8,    
        float *outputArray1,
        float *outputArray2)  
{  

    float scalar9;  
    int idx;  
    double scalar10;
    double scalar11;  
    float sumReal, sumImag;  
    float real, imag;  

    float coeff1, coeff2, coeff3, coeff4;  

    sumReal = 0.0;  
    sumImag = 0.0;  

    // kk loop 1 .. 512 (scalar7)  
    idx = (blockIdx.x * blockDim.x) + threadIdx.x;  

    /* Declare the shared memory parameters */
    __shared__ float SharedArray1[MaxSize];
    __shared__ double SharedArray2[MaxSize];

    /* populate the arrays on shared memory */
    SharedArray1[idx] = array1[idx];  // first 512 elements
    SharedArray2[idx] = array2[idx];
    if (idx+blockDim.x < MaxSize){
        SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
        SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
    }            
    __syncthreads();

    // input scalars used here.
    scalar10 = ...;
    scalar11 = ...;

    for (int kk = 0; kk < scalar8; kk++)
    {  
        /* some calculations */
        // SharedArray1, SharedArray2 and scalar9 used here
        sumReal = ...;
        sumImag = ...;
    }  


    /* calculation of the exponential of a complex number */
    real = ...;
    imag = ...;
    coeff1 = (sumReal * real);  
    coeff2 = (sumReal * imag);  
    coeff3 = (sumImag * real);  
    coeff4 = (sumImag * imag);  

    outputArray1[idx] = (coeff1 - coeff4);  
    outputArray2[idx] = (coeff2 + coeff3);  


}

#定义twoPi 6.283185307179586
#定义灯光3.0E8的速度
#定义MaxSize 999
__全局无效calcRx4CPP4
(  
常量浮点*数组1，
常数双*阵列2，
常量浮点标量1，
__shared__ float SharedArray1[999];     
__shared__ double SharedArray2[999];

/* populate the arrays on shared memory */     
SharedArray1[1023] = array1[1023]; 
SharedArray2[1023] = array2[1023];     

if (2047 < MaxSize)
{         
    SharedArray1[2047] = array1[2047];         
    SharedArray2[2047] = array2[2047];     
}                 
__syncthreads();