CUDA字节原子操作只导致一个线程动作

CUDA字节原子操作只导致一个线程动作,cuda,shared-memory,atomic,Cuda,Shared Memory,Atomic,我正在写一个CUDA程序,它在共享内存中定义了一个数组。我需要做的是只允许一个线程写入这个数组中的每个索引,即。E到达此写入指令的第一个线程应更改其值,但同一个扭曲或下一个扭曲中的任何其他线程都应读取写入的值 以下是代码片段: char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array if (seq_copied[seq_1_index] == false) { //here is th

我正在写一个CUDA程序,它在共享内存中定义了一个数组。我需要做的是只允许一个线程写入这个数组中的每个索引,即。E到达此写入指令的第一个线程应更改其值,但同一个扭曲或下一个扭曲中的任何其他线程都应读取写入的值

以下是代码片段:

char* seq_copied = seqs + (njobNew * halfLength); //this is the shared memory array
if (seq_copied[seq_1_index] == false) { //here is the condition that I need to check with only one thread
    seq_copied[seq_1_index] = true; //and this is the write that should be written by only one thread
    printf("copy seq_shared seq_1_index = %d,  block = %d \n", seq_1_index, blockIdx.x);
}
现在发生的是,warp中的所有线程都执行这些精确的指令序列,因此if条件中的剩余代码执行32次。我只需要执行一次

我怎样才能做到这一点呢?

您可以使用它。它执行原子比较和交换操作

这个函数将测试一个变量,如果它匹配某个条件(比如false),它将用另一个值(比如true)替换它。它将以原子方式完成所有这些事情,即不存在中断的可能性

在这种情况下,原子函数的返回值为我们提供了有用的信息。如果上述示例的返回值为false,那么我们可以确定它已替换为true。我们还可以确定,我们是第一个遇到这种情况的线程,所有其他执行类似操作的线程的返回值都是true,而不是false

下面是一个成功的例子:

$ cat t327.cu
#include <stdio.h>

__global__ void k(){

  __shared__ int flag;
  if (threadIdx.x == 0) flag = 0;
  __syncthreads();

  int retval = atomicCAS(&flag, 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
  // could do if statement on retval here
}


int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned mask = 0xFFU;
  mask <<= al_offset;
  mask = ~mask;
  unsigned sval = val;
  sval <<= al_offset;
  unsigned old = *al_addr, assumed, setval;
  do {
        assumed = old;
        setval = assumed & mask;
        setval |= sval;
        old = atomicCAS(al_addr, assumed, setval);
    } while (assumed != old);
  return (char) ((assumed >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned my_bit = 1U << al_offset;
  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
上面给出了
char
大小的广义
atomicCAS
。这将允许您将任何
char
值交换为任何其他
char
值。在您的特定情况下,如果您只需要一个有效的布尔标志,那么您可以使用
atomicOr
使此操作更有效,如注释中所述。使用
atomicOr
可以消除上述自定义原子函数中的循环。以下是一个成功的例子:

$ cat t327.cu
#include <stdio.h>

__global__ void k(){

  __shared__ int flag;
  if (threadIdx.x == 0) flag = 0;
  __syncthreads();

  int retval = atomicCAS(&flag, 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
  // could do if statement on retval here
}


int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned mask = 0xFFU;
  mask <<= al_offset;
  mask = ~mask;
  unsigned sval = val;
  sval <<= al_offset;
  unsigned old = *al_addr, assumed, setval;
  do {
        assumed = old;
        setval = assumed & mask;
        setval |= sval;
        old = atomicCAS(al_addr, assumed, setval);
    } while (assumed != old);
  return (char) ((assumed >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned my_bit = 1U << al_offset;
  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$cat t327.cu
#包括
__设备字符我的字符原子标记(字符*地址){
unsigned*al_addr=重新解释强制转换((unsigned long)addr)和(0xfffffffffffffull));
无符号al_偏移=((无符号)((无符号长-长)地址)和3))*8;
无符号my_位=1U>al_偏移量)&0xFFU);
}
__全局无效k(){
__共享字符标志[1024];
标志[threadIdx.x]=0;
__同步线程();
int retval=my_char_atomic_标志(标志+(threadIdx.x>>1));
printf(“线程%d将标志视为%d\n”,threadIdx.x,retval);
}
int main(){
k();
cudaDeviceSynchronize();
}
$nvcc-o t327 t327.cu
$cuda memcheck./t327
==========CUDA-MEMCHECK
线程0将标志视为0
线程1将标志视为1
线程2将标志视为0
线程3将标志视为1
线程4将标志视为0
螺纹5将标志锯为1
线程6将标志视为0
线程7将标志视为1
线程8将标志视为0
线程9将标志视为1
线程10将标志锯为0
线程11将标志视为1
线程12将标志视为0
螺纹13将标志锯为1
线程14将标志视为0
螺纹15将标志锯为1
线程16将标志视为0
螺纹17将标志锯为1
线程18将标志视为0
螺纹19将标志锯为1
线程20将标志锯为0
线程21将标志视为1
线程22将标志视为0
螺纹23将标志锯为1
线程24将标志视为0
螺纹25将标志锯为1
线程26将标志锯为0
螺纹27将标志锯为1
线程28将标志视为0
线程29将标志视为1
线程30将标志锯为0
线程31将标志锯为1
======错误摘要:0个错误
$
这些
char
原子方法假设您分配了一个
char
数组,其大小是4的倍数。例如,使用大小为3的
char
数组(并且只有3个线程)执行此操作是无效的。

可以用于此操作。它执行原子比较和交换操作

这个函数将测试一个变量,如果它匹配某个条件(比如false),它将用另一个值(比如true)替换它。它将以原子方式完成所有这些事情,即不存在中断的可能性

在这种情况下,原子函数的返回值为我们提供了有用的信息。如果上述示例的返回值为false,那么我们可以确定它已替换为true。我们还可以确定,我们是第一个遇到这种情况的线程,所有其他执行类似操作的线程的返回值都是true,而不是false

下面是一个成功的例子:

$ cat t327.cu
#include <stdio.h>

__global__ void k(){

  __shared__ int flag;
  if (threadIdx.x == 0) flag = 0;
  __syncthreads();

  int retval = atomicCAS(&flag, 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
  // could do if statement on retval here
}


int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned mask = 0xFFU;
  mask <<= al_offset;
  mask = ~mask;
  unsigned sval = val;
  sval <<= al_offset;
  unsigned old = *al_addr, assumed, setval;
  do {
        assumed = old;
        setval = assumed & mask;
        setval |= sval;
        old = atomicCAS(al_addr, assumed, setval);
    } while (assumed != old);
  return (char) ((assumed >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned my_bit = 1U << al_offset;
  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
上面给出了
char
大小的广义
atomicCAS
。这将允许您将任何
char
值交换为任何其他
char
值。在您的特定情况下,如果您只需要一个有效的布尔标志,那么您可以使用
atomicOr
使此操作更有效,如注释中所述。使用
atomicOr
可以消除上述自定义原子函数中的循环。以下是一个成功的例子:

$ cat t327.cu
#include <stdio.h>

__global__ void k(){

  __shared__ int flag;
  if (threadIdx.x == 0) flag = 0;
  __syncthreads();

  int retval = atomicCAS(&flag, 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
  // could do if statement on retval here
}


int main(){

  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 1
thread 3 saw flag as 1
thread 4 saw flag as 1
thread 5 saw flag as 1
thread 6 saw flag as 1
thread 7 saw flag as 1
thread 8 saw flag as 1
thread 9 saw flag as 1
thread 10 saw flag as 1
thread 11 saw flag as 1
thread 12 saw flag as 1
thread 13 saw flag as 1
thread 14 saw flag as 1
thread 15 saw flag as 1
thread 16 saw flag as 1
thread 17 saw flag as 1
thread 18 saw flag as 1
thread 19 saw flag as 1
thread 20 saw flag as 1
thread 21 saw flag as 1
thread 22 saw flag as 1
thread 23 saw flag as 1
thread 24 saw flag as 1
thread 25 saw flag as 1
thread 26 saw flag as 1
thread 27 saw flag as 1
thread 28 saw flag as 1
thread 29 saw flag as 1
thread 30 saw flag as 1
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomicCAS(char *addr, char cmp, char val){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned mask = 0xFFU;
  mask <<= al_offset;
  mask = ~mask;
  unsigned sval = val;
  sval <<= al_offset;
  unsigned old = *al_addr, assumed, setval;
  do {
        assumed = old;
        setval = assumed & mask;
        setval |= sval;
        old = atomicCAS(al_addr, assumed, setval);
    } while (assumed != old);
  return (char) ((assumed >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomicCAS(flag+(threadIdx.x>>1), 0, 1);
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$ cat t327.cu
#include <stdio.h>
__device__ char my_char_atomic_flag(char *addr){
  unsigned *al_addr = reinterpret_cast<unsigned *> (((unsigned long long)addr) & (0xFFFFFFFFFFFFFFFCULL));
  unsigned al_offset = ((unsigned)(((unsigned long long)addr) & 3)) * 8;
  unsigned my_bit = 1U << al_offset;
  return (char) ((atomicOr(al_addr, my_bit) >> al_offset) & 0xFFU);
}

__global__ void k(){

  __shared__ char flag[1024];
  flag[threadIdx.x] = 0;
  __syncthreads();

  int retval = my_char_atomic_flag(flag+(threadIdx.x>>1));
  printf("thread %d saw flag as %d\n", threadIdx.x, retval);
}


int main(){
  k<<<1,32>>>();
  cudaDeviceSynchronize();
}
$ nvcc -o t327 t327.cu
$ cuda-memcheck ./t327
========= CUDA-MEMCHECK
thread 0 saw flag as 0
thread 1 saw flag as 1
thread 2 saw flag as 0
thread 3 saw flag as 1
thread 4 saw flag as 0
thread 5 saw flag as 1
thread 6 saw flag as 0
thread 7 saw flag as 1
thread 8 saw flag as 0
thread 9 saw flag as 1
thread 10 saw flag as 0
thread 11 saw flag as 1
thread 12 saw flag as 0
thread 13 saw flag as 1
thread 14 saw flag as 0
thread 15 saw flag as 1
thread 16 saw flag as 0
thread 17 saw flag as 1
thread 18 saw flag as 0
thread 19 saw flag as 1
thread 20 saw flag as 0
thread 21 saw flag as 1
thread 22 saw flag as 0
thread 23 saw flag as 1
thread 24 saw flag as 0
thread 25 saw flag as 1
thread 26 saw flag as 0
thread 27 saw flag as 1
thread 28 saw flag as 0
thread 29 saw flag as 1
thread 30 saw flag as 0
thread 31 saw flag as 1
========= ERROR SUMMARY: 0 errors
$
$cat t327.cu
#包括
__设备字符我的字符原子标记(字符*地址){
unsigned*al_addr=重新解释强制转换((unsigned long)addr)和(0xfffffffffffffull));
无符号al_偏移=((无符号)((无符号长-长)地址)和3))*8;
无符号my_位=1U>al_偏移量)&0xFFU);
}
__全局无效k(){
__共享字符标志[1024];
标志[threadIdx.x]=0;
__同步线程();
int retval=my_char_atomic_标志(标志+(threadIdx.x>>1));
printf(“线程%d将标志视为%d\n”,threadIdx.x,retval);
}
int main(){
k();
cudaDeviceSynchronize();
}
$nvcc-o t327 t327.cu
$cuda memcheck./t327
==========CUDA-MEMCHECK
线程0将标志视为0
线程1将标志视为1
线程2将标志视为0
线程3将标志视为1
线程4将标志视为0
螺纹5将标志锯为1
线程6将标志视为0
线程7将标志视为1
线程8将标志视为0
线程9将标志视为1
线程10将标志锯为0
线程11将标志视为1
线程12将标志视为0
螺纹13将标志锯为1
线程14将标志视为0
螺纹15将标志锯为1
线程16将标志视为0
螺纹17将标志锯为1
线程18将标志视为0
螺纹19将标志锯为1
线程20将标志锯为0
线程21将标志视为1
线程22将标志视为0
螺纹23将标志锯为1
线程24将标志视为0