Cuda 与扭曲同步线程执行如何工作的直觉斗争_Cuda_Parallel Processing_Gpu_Reduction

Cuda 与扭曲同步线程执行如何工作的直觉斗争

cuda parallel-processing

Cuda 与扭曲同步线程执行如何工作的直觉斗争,cuda,parallel-processing,gpu,reduction,Cuda,Parallel Processing,Gpu,Reduction,我是CUDA的新手。我正在研究基本的并行算法，比如reduce，以了解线程执行是如何工作的。我有以下代码： __global__ void Reduction2_kernel( int *out, const int *in, size_t N ) { extern __shared__ int sPartials[]; int sum = 0; const int tid = threadIdx.x; for ( size_t i = blockIdx.x*bl

我是CUDA的新手。我正在研究基本的并行算法，比如reduce，以了解线程执行是如何工作的。我有以下代码：

__global__ void
Reduction2_kernel( int *out, const int *in, size_t N )
{
    extern __shared__ int sPartials[];
    int sum = 0;
    const int tid = threadIdx.x;
    for ( size_t i = blockIdx.x*blockDim.x + tid;
          i < N;
          i += blockDim.x*gridDim.x ) {
        sum += in[i];
    }
    sPartials[tid] = sum;
    __syncthreads();

    for ( int activeThreads = blockDim.x>>1;
              activeThreads > 32;
              activeThreads >>= 1 ) {
        if ( tid < activeThreads ) {
            sPartials[tid] += sPartials[tid+activeThreads];
        }
        __syncthreads();
    }
    if ( threadIdx.x < 32 ) {
        volatile int *wsSum = sPartials;
        if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this statement, any exampele please?
        wsSum[tid] += wsSum[tid + 16];  //how these statements are executed in paralle within a warp
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];
        if ( tid == 0 ) {
            volatile int *wsSum = sPartials;// why this statement is needed?
            out[blockIdx.x] = wsSum[0];
        }
    }
}

全局无效
精简2_内核（int*out，const int*in，size\u t N）
{
外部共享的内部参数[]；
整数和=0；
const int tid=threadIdx.x；
对于（尺寸=块IDX.x*块DIM.x+tid；
i>1；
活动线程>32；
活动线程>>=1）{
if（tid32）wsSum[tid]+=wsSum[tid+32]；//为什么我们需要这个语句，请举例说明？
wsSum[tid]+=wsSum[tid+16]；//这些语句是如何在warp中并行执行的
wsSum[tid]+=wsSum[tid+8]；
wsSum[tid]+=wsSum[tid+4]；
wsSum[tid]+=wsSum[tid+2]；
wsSum[tid]+=wsSum[tid+1]；
如果（tid==0）{
volatile int*wsSum=sparials；//为什么需要此语句？
out[blockIdx.x]=wsSum[0]；
}
}
}

不幸的是，我不清楚代码在if（threadIdx.x<32）条件和之后是如何工作的。有人能给出一个直观的线程ID示例，以及这些语句是如何执行的吗？我认为理解这些概念很重要，所以任何帮助都是有帮助的

在前两个代码块（由_syncthreads（）分隔）之后，可以在每个线程块中获得64个值（存储在每个线程块的sPartials[]中）。因此，

if（threadIdx.x<32）

中的代码将累加每个spatials[]中的64个值。这只是为了优化减速速度。因为累加的其余步骤的数据很小，所以不值得减少线程和循环。您可以在第二个代码块中调整条件

for ( int activeThreads = blockDim.x>>1;
              activeThreads > 32;
              activeThreads >>= 1 )

到

而不是

if ( threadIdx.x < 32 ) {
        volatile int *wsSum = sPartials;
        if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; 
        wsSum[tid] += wsSum[tid + 16]; 
        wsSum[tid] += wsSum[tid + 8];
        wsSum[tid] += wsSum[tid + 4];
        wsSum[tid] += wsSum[tid + 2];
        wsSum[tid] += wsSum[tid + 1];

if（threadIdx.x<32）{
volatile int*wsSum=spatials；
如果（blockDim.x>32）wsSum[tid]+=wsSum[tid+32]；
wsSum[tid]+=wsSum[tid+16]；
wsSum[tid]+=wsSum[tid+8]；
wsSum[tid]+=wsSum[tid+4]；
wsSum[tid]+=wsSum[tid+2]；
wsSum[tid]+=wsSum[tid+1]；

为了更好的理解

在累加之后，每个sPartials[]只能得到一个值，并存储在sPartials[0]中，代码中的wsSum[0]

在完成内核函数之后，您可以在CPU中累积wsSum中的值，以获得最终结果。

让我们分块查看代码，并回答您的问题：

int sum = 0;
const int tid = threadIdx.x;
for ( size_t i = blockIdx.x*blockDim.x + tid;
      i < N;
      i += blockDim.x*gridDim.x ) {
    sum += in[i];
}

当每个线程完成时（即，当它的for循环超过

），它将其中间

和存储在共享内存中，然后等待块中的所有其他线程完成
for ( int activeThreads = blockDim.x>>1;
          activeThreads > 32;
          activeThreads >>= 1 ) {
    if ( tid < activeThreads ) {
        sPartials[tid] += sPartials[tid+activeThreads];
    }
    __syncthreads();
}

对于这一点之后的内核剩余部分，我们将只使用前32个线程（即第一个扭曲）。所有其他线程将保持空闲。请注意，在这一点之后也没有\uuuu syncthreads（）；
，因为这违反了使用它的规则（所有线程都必须参与\uusyncthreads（）；
）
我们现在正在创建一个指向共享内存的volatile
指针。理论上，这告诉编译器它不应该进行各种优化，例如优化寄存器中的特定值。为什么我们以前不需要它呢？因为\u syncthreads（）；
也带有它。a\u syncthreads（）；
调用，除了使所有线程在屏障处等待彼此之外，还强制所有线程更新返回共享或全局内存。但是，我们不能再依赖此功能，因为从现在起，我们将不再使用\u syncthreads（）因为我们已经将自己——对于内核的其余部分——限制为一个扭曲
    if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32]; // why do we need this

上一个归约块留给我们的是64个部分和。但现在我们将自己限制在32个线程内。因此，我们必须再进行一次组合，将64个部分和归并为32个部分和，然后才能继续归约的其余部分
    wsSum[tid] += wsSum[tid + 16];  //how these statements are executed in paralle within a warp

现在，我们终于进入了一些warp同步编程。这一行代码取决于32个线程在lockstep中执行的事实。要理解为什么（以及它是如何工作的），可以方便地将其分解为完成这一行代码所需的操作序列。它看起来像：
    read the partial sum of my thread into a register
    read the partial sum of the thread that is 16 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

在锁定步骤中，所有32个线程都将遵循上述顺序。所有32个线程都将首先读取wsSum[tid]
到（线程本地）寄存器中。这意味着线程0读取wsSum[0]
，线程1读取wsSum[1]
等。之后，每个线程将另一个部分和读入不同的寄存器：线程0读入wsSum[16]
，线程1读入wsSum[17]
，等等。的确，我们不关心wsSum[32]
（及更高）值；我们已经将这些值折叠到前32个wsSum[]
values。但是，正如我们将看到的，只有前16个线程（在这一步中）将对最终结果起作用，因此前16个线程将把32个部分和合并为16个。接下来的16个线程也将起作用，但它们只是在做垃圾工作——这将被忽略
上述步骤将32个部分和合并到wsSum[]
中的前16个位置。下一行代码：
    wsSum[tid] += wsSum[tid + 8];

以8的粒度重复此过程。同样，所有32个线程都处于活动状态，微序列如下所示：
    read the partial sum of my thread into a register
    read the partial sum of the thread that is 8 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

所以t
    wsSum[tid] += wsSum[tid + 16];  //how these statements are executed in paralle within a warp

    read the partial sum of my thread into a register
    read the partial sum of the thread that is 16 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

    wsSum[tid] += wsSum[tid + 8];

    read the partial sum of my thread into a register
    read the partial sum of the thread that is 8 higher than my thread, into a register
    add the two partial sums
    store the result back into the partial sum corresponding to my thread

    wsSum[tid] += wsSum[tid + 4];  //this combines partial sums of interest into 4 locations
    wsSum[tid] += wsSum[tid + 2];  //this combines partial sums of interest into 2 locations
    wsSum[tid] += wsSum[tid + 1];  //this combines partial sums of interest into 1 location

    if ( tid == 0 ) {
        volatile int *wsSum = sPartials;// why this statement is needed?
        out[blockIdx.x] = wsSum[0];
    }

if ( blockDim.x > 32 ) wsSum[tid] += wsSum[tid + 32];