CUDA将_syncthreads替换为_threadfence（）差异_Cuda

CUDA将_syncthreads替换为_threadfence（）差异

cuda

CUDA将_syncthreads替换为_threadfence（）差异,cuda,Cuda,我从NVIDIA手册中复制了以下代码，例如：\uu threadfence（）。为什么会这样在下面的代码中使用了\uu threadfence（）。我认为使用\uu syncthreads（）而不是 \uuu threadfence（）将给出相同的结果有人能解释一下\u syncthreads（）和\u threadfence（）调用之间的区别吗 __device__ unsigned int count = 0; __shared__ bool isLastBlockDone; __gl

我从NVIDIA手册中复制了以下代码，例如：

\uu threadfence（）

。为什么会这样在下面的代码中使用了

\uu threadfence（）

。我认为使用

\uu syncthreads（）

而不是

\uuu threadfence（）

将给出相同的结果

有人能解释一下

\u syncthreads（）

和

\u threadfence（）

调用之间的区别吗

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;

__global__ void sum(const float* array, unsigned int N,float* result)
{
    // Each block sums a subset of the input array
    float partialSum = calculatePartialSum(array, N);

    if (threadIdx.x == 0) {
        // Thread 0 of each block stores the partial sum
        // to global memory
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure its result is visible to
        // all other threads
        __threadfence();

        // Thread 0 of each block signals that it is done
        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 of each block determines if its block is
        // the last block to be done
        isLastBlockDone = (value == (gridDim.x - 1));
    }

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone
    __syncthreads();

    if (isLastBlockDone) 
    {
        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0)
        {
            // Thread 0 of last block stores total sum
            // to global memory and resets count so that
            // next kernel call works properly
            result[0] = totalSum;
            count = 0;
        }
    }
}

就共享内存而言，

\u syncthreads（）

比

\u threadfence（）

更强大。关于全局内存，这是两件不同的事情

```
\uuuu threadfence\u block（）
```
暂停当前线程，直到同一块中的其他线程可以看到对共享内存的所有写入。它通过在寄存器中缓存共享内存写入来防止编译器进行优化。它不会同步线程，并且并非所有线程都必须实际到达此指令
```
\uuu threadfence（）
```
暂停当前线程，直到所有其他线程都可以看到对共享和全局内存的所有写入
```
\uuuu syncthreads（）
```
必须由块中的所有线程访问（例如，对于块中的所有线程，如果语句，则无发散的
```
语句），并确保在执行指令之前执行指令后面的代码
```


在您的特定情况下，使用\uu threadfence（）
指令确保对全局数组结果的写入对每个人都可见\uuu syncthreads（）
只会同步当前块中的线程，而不会强制对其他块执行全局内存写入。更重要的是，在代码中的这一点上，您位于if
分支中，只有一个线程在执行该代码；使用\uuu syncthreads（）
将导致GPU的未定义行为，很可能导致内核完全去同步
查看CUDA C编程指南中的以下章节：

3.2.2“共享内存”-矩阵乘法示例
5.4.3“同步指令”
B.2.5“挥发性”
B.5“记忆栅栏功能”
a在分裂扭曲的if
分支内使用\u syncthreads（）
会导致死锁，而不是去同步。从形式上讲，它会导致“未定义的行为”。您可能是对的，扭曲发散的if
可能会导致死锁。但是，即使扭曲收敛，但块发散（如果
），一个扭曲可能会在不同的uu syncthreads
处停止，因为它们无法区分（至少在某些GPU上），从而导致取消同步。一句话：坏事发生了；不要这样做。