Parallel processing 使用推力为Cuda中的线程划分作业_Parallel Processing_Cuda_Thrust

Parallel processing 使用推力为Cuda中的线程划分作业

parallel-processing cuda

Parallel processing 使用推力为Cuda中的线程划分作业,parallel-processing,cuda,thrust,Parallel Processing,Cuda,Thrust,我有一个测试代码，需要更新类的设备向量中的键。因此，如何将部分工作划分为特定的线程不带除法的代码示例： __global__ void UpdateKeys(Request* vector, int size, int seed, int qt_threads){ curandState_t state; curand_init(seed, threadIdx.x, 0, &state); int id = blockIdx.x * blockDim.x + thre

我有一个测试代码，需要更新类的设备向量中的键。因此，如何将部分工作划分为特定的线程

不带除法的代码示例：

__global__ void UpdateKeys(Request* vector, int size, int seed, int qt_threads){
   curandState_t state;
   curand_init(seed, threadIdx.x, 0, &state);
   int id = blockIdx.x * blockDim.x + threadIdx.x;
   if(id < size){
       vector[i].key_ = (curand(&state % 100) / 100;
   }
}

\uuuu全局\uuuuu无效更新键（请求*向量、int大小、int种子、int qt\u线程）{
库兰州；
curand_init（seed、threadIdx.x、0和state）；
int id=blockIdx.x*blockDim.x+threadIdx.x；
如果（id<大小）{
向量[i]。键=（curand（&状态%100）/100；
}
}

该向量作为推力：：设备\向量传递

我想要的示例：

1000个键和2000个线程：仅使用1000个，并为每个线程提供一个键。
1000个键和1000个线程：全部使用。
1个键和100个线程：使用1个线程。
500个键和250个线程：每个线程负责2个。

240个键和80个线程：每个线程负责3个。

如果您这样修改基本内核结构：

__global__ void UpdateKeys(Request* vector, int size, int seed, int qt_threads){
   curandState_t state;
   curand_init(seed, threadIdx.x, 0, &state);
   int id = blockIdx.x * blockDim.x + threadIdx.x;
   int gid = blockDim.x * gridDim.x;
   for(; id < size; id += gid){
       vector[id].key_ = (curand(&state) % 100) / 100;
   }
}

\uuuu全局\uuuuu无效更新键（请求*向量、int大小、int种子、int qt\u线程）{
库兰州；
curand_init（seed、threadIdx.x、0和state）；
int id=blockIdx.x*blockDim.x+threadIdx.x；
int gid=blockDim.x*gridDim.x；
对于（；id


那么，任何合法的一维块大小（以及一维块的数量）都应该是可能的通过size
参数处理您选择提供的尽可能多或尽可能少的输入。如果您运行的线程多于键，则某些线程将什么也不做。如果您运行的线程少于键，则某些线程将处理多个键。
“例如，使用1000个线程更新500个键，因此每个线程需要处理2个”？Oops。我的错。应该是500个键，使用500个线程。如何划分…？
您是否因为性能问题而询问？是的。我需要最大限度的优化，但如果不知道如何为特定线程（如示例）分配工作，我就不能这样做。这是一个很好的解决方案，但您假设我将使用多个网格，对吗？而且，我想，这会让我失去时间。@阿尔瓦罗埃斯皮恩多拉：呃，不。不，是为了失去时间还是为了网格？为了网格。您只需要一个内核启动是的，但我不能像现在这样使用网格，因为我计划使用then foward来做另一件事。