cuda线程和块 我把这个贴在英伟达论坛上,我想我可以多看几眼。
我在尝试扩展代码以执行多个案例时遇到问题。我一直在开发最常见的案例,现在是测试的时候了,我需要确保它适用于不同的案例。目前,我的内核是在一个循环中执行的(我们不执行一个内核调用来完成整个任务是有原因的),以计算矩阵行中的值。最常见的情况是512列乘512行。我需要考虑大小为512×512, 1024×512, 512×1024的组合,以及其他组合,但最大的将是1024×1024矩阵。我一直在使用一个相当简单的内核调用:cuda线程和块 我把这个贴在英伟达论坛上,我想我可以多看几眼。,cuda,Cuda,我在尝试扩展代码以执行多个案例时遇到问题。我一直在开发最常见的案例,现在是测试的时候了,我需要确保它适用于不同的案例。目前,我的内核是在一个循环中执行的(我们不执行一个内核调用来完成整个任务是有原因的),以计算矩阵行中的值。最常见的情况是512列乘512行。我需要考虑大小为512×512, 1024×512, 512×1024的组合,以及其他组合,但最大的将是1024×1024矩阵。我一直在使用一个相当简单的内核调用: launchKernel<<<1,512>>&
launchKernel<<<1,512>>>(................)
launchKernel(……)
该内核适用于常见的512x512和512 x 1024(分别为列和行)情况,但不适用于1024 x 512情况。这种情况需要1024个线程来执行。在我的天真中,我一直在尝试不同版本的简单内核调用来启动1024个线程
launchKernel<<<2,512>>>(................) // 2 blocks with 512 threads each ???
launchKernel<<<1,1024>>>(................) // 1 block with 1024 threads ???
launchKernel(…..)//2个块,每个块有512个线程???
launchKernel(……..//1个具有1024个线程的块???
我相信我的问题与我对线程和块缺乏理解有关
这里是deviceQuery的输出,正如您所看到的,我最多可以有1024个线程
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU Computing SDK 4.1\C\bin\win64\Release\deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 2 CUDA Capable device(s)
Device 0: "Tesla C2050"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2688 MBytes (2818572288 bytes)
(14) Multiprocessors x (32) CUDA Cores/MP: 448 CUDA Cores
GPU Clock Speed: 1.15 GHz
Memory Clock rate: 1500.00 Mhz
Memory Bus Width: 384-bit
L2 Cache Size: 786432 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 2 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 40 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
Device 1: "Quadro 600"
CUDA Driver Version / Runtime Version 4.2 / 4.1
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 1024 MBytes (1073741824 bytes)
( 2) Multiprocessors x (48) CUDA Cores/MP: 96 CUDA Cores
GPU Clock Speed: 1.28 GHz
Memory Clock rate: 800.00 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 131072 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): No
Device PCI Bus ID / PCI location ID: 15 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.2, CUDA Runtime Version = 4.1, NumDevs = 2, Device = Tesla C2050, Device = Quadro 600
C:\ProgramData\NVIDIA Corporation\NVIDIA GPU计算SDK 4.1\C\bin\win64\Release\deviceQuery.exe正在启动。。。
CUDA设备查询(运行时API)版本(CUDART静态链接)
找到2个支持CUDA的设备
设备0:“特斯拉C2050”
CUDA驱动程序版本/运行时版本4.2/4.1
CUDA能力主要/次要版本号:2.0
全局内存总量:2688 MB(2818572288字节)
(14) 多处理器x(32)个CUDA内核/MP:448个CUDA内核
GPU时钟速度:1.15 GHz
内存时钟频率:1500.00 Mhz
内存总线宽度:384位
二级缓存大小:786432字节
最大纹理标注大小(x、y、z)1D=(65536)、2D=(6553665535)、3D=(204820482048)
最大分层纹理大小(dim)x层1D=(16384)x 2048,2D=(1638416384)x 2048
恒定内存总量:65536字节
每个块的共享内存总量:49152字节
每个块可用的寄存器总数:32768
经纱尺寸:32
每个块的最大线程数:1024
块的每个维度的最大大小:1024 x 1024 x 64
网格各尺寸的最大尺寸:65535 x 65535 x 65535
最大内存间距:2147483647字节
纹理对齐:512字节
并发复制和执行:有2个复制引擎
内核的运行时间限制:是
集成GPU共享主机内存:否
支持主机页锁定内存映射:是
并发内核执行:是
表面对齐要求:是
设备已启用ECC支持:是
设备正在使用变矩器离合器驱动模式:否
设备支持统一寻址(UVA):否
设备PCI总线ID/PCI位置ID:40/0
计算模式:
装置1:“Quadro 600”
CUDA驱动程序版本/运行时版本4.2/4.1
CUDA能力主要/次要版本号:2.1
全局内存总量:1024 MB(1073741824字节)
(2)多处理器x(48)个CUDA核/MP:96个CUDA核
GPU时钟速度:1.28 GHz
内存时钟频率:800.00 Mhz
内存总线宽度:128位
二级缓存大小:131072字节
最大纹理标注大小(x、y、z)1D=(65536)、2D=(6553665535)、3D=(204820482048)
最大分层纹理大小(dim)x层1D=(16384)x 2048,2D=(1638416384)x 2048
恒定内存总量:65536字节
每个块的共享内存总量:49152字节
每个块可用的寄存器总数:32768
经纱尺寸:32
每个块的最大线程数:1024
块的每个维度的最大大小:1024 x 1024 x 64
网格各尺寸的最大尺寸:65535 x 65535 x 65535
最大内存间距:2147483647字节
纹理对齐:512字节
并发复制和执行:是,使用1个复制引擎
内核的运行时间限制:是
集成GPU共享主机内存:否
支持主机页锁定内存映射:是
并发内核执行:是
表面对齐要求:是
设备已启用ECC支持:否
设备正在使用变矩器离合器驱动模式:否
设备支持统一寻址(UVA):否
设备PCI总线ID/PCI位置ID:15/0
计算模式:
deviceQuery,CUDA驱动程序=CUDART,CUDA驱动程序版本=4.2,CUDA运行时版本=4.1,NumDevs=2,设备=Tesla C2050,设备=Quadro 600
我只使用特斯拉C2050设备
这是我的内核的一个精简版本,因此您对它的功能有了一个概念
#define twoPi 6.283185307179586
#define speed_of_light 3.0E8
#define MaxSize 999
__global__ void calcRx4CPP4
(
const float *array1,
const double *array2,
const float scalar1,
const float scalar2,
const float scalar3,
const float scalar4,
const float scalar5,
const float scalar6,
const int scalar7,
const int scalar8,
float *outputArray1,
float *outputArray2)
{
float scalar9;
int idx;
double scalar10;
double scalar11;
float sumReal, sumImag;
float real, imag;
float coeff1, coeff2, coeff3, coeff4;
sumReal = 0.0;
sumImag = 0.0;
// kk loop 1 .. 512 (scalar7)
idx = (blockIdx.x * blockDim.x) + threadIdx.x;
/* Declare the shared memory parameters */
__shared__ float SharedArray1[MaxSize];
__shared__ double SharedArray2[MaxSize];
/* populate the arrays on shared memory */
SharedArray1[idx] = array1[idx]; // first 512 elements
SharedArray2[idx] = array2[idx];
if (idx+blockDim.x < MaxSize){
SharedArray1[idx+blockDim.x] = array1[idx+blockDim.x];
SharedArray2[idx+blockDim.x] = array2[idx+blockDim.x];
}
__syncthreads();
// input scalars used here.
scalar10 = ...;
scalar11 = ...;
for (int kk = 0; kk < scalar8; kk++)
{
/* some calculations */
// SharedArray1, SharedArray2 and scalar9 used here
sumReal = ...;
sumImag = ...;
}
/* calculation of the exponential of a complex number */
real = ...;
imag = ...;
coeff1 = (sumReal * real);
coeff2 = (sumReal * imag);
coeff3 = (sumImag * real);
coeff4 = (sumImag * imag);
outputArray1[idx] = (coeff1 - coeff4);
outputArray2[idx] = (coeff2 + coeff3);
}
#定义twoPi 6.283185307179586
#定义灯光3.0E8的速度
#定义MaxSize 999
__全局无效calcRx4CPP4
(
常量浮点*数组1,
常数双*阵列2,
常量浮点标量1,
__shared__ float SharedArray1[999];
__shared__ double SharedArray2[999];
/* populate the arrays on shared memory */
SharedArray1[1023] = array1[1023];
SharedArray2[1023] = array2[1023];
if (2047 < MaxSize)
{
SharedArray1[2047] = array1[2047];
SharedArray2[2047] = array2[2047];
}
__syncthreads();