如何通过numba在CUDA中顺序执行代码？_Cuda_Gpu_Numba

如何通过numba在CUDA中顺序执行代码？

cuda

如何通过numba在CUDA中顺序执行代码？,cuda,gpu,numba,Cuda,Gpu,Numba,我想按顺序执行cuda线程。例如在上图中，我希望索引为[thread_id，j]的值按顺序馈送，即，仅当数组[0,0]、数组[0,1]、数组[0,2]等给定时，才会给数组[1,2]一个值我能想到的方法是设置一个全局数组，并不断检索数组[0,3]的值。当数组[0,3]给定时，我可以馈送数组[1,2] 但是，此操作失败，代码如下： import math import numpy as np from numba import cuda @cuda.jit def keep_max(glob

我想按顺序执行cuda线程。例如

在上图中，我希望索引为[thread_id，j]的值按顺序馈送，即，仅当数组[0,0]、数组[0,1]、数组[0,2]等给定时，才会给数组[1,2]一个值

我能想到的方法是设置一个全局数组，并不断检索数组[0,3]的值。当数组[0,3]给定时，我可以馈送数组[1,2]

但是，此操作失败，代码如下：

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # loop through other elements in global_array
            for j in range(thread_id+1, N):

                # consistently read values from array
                for _ in range(1000): # or while True:

                    # for thread_id == 0, just execute
                    if thread_id==0:
                        cuda.atomic.add(array,(thread_id,j), 1)
                        break

                    # for thread_id>0
                    else: 

                        # if j reaches the last number of global_array
                        # just execute
                        if j == N-1:
                            cuda.atomic.add(array,(thread_id,j), 1)
                            break
                        else:  

                            # check if the previous thread_id, i.e., thread_id - 1
                            # finishes the execution of combination [thread_id-1,j+1]
                            if array[thread_id-1,j+1]>0:
                                cuda.atomic.add(array,(thread_id,j), 1)
                                break


N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)


output:

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

我能想到的另一种方法是使用

cuda.syncthreads（）

，下面是代码：

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # j starts from thread_id + 1
            j = thread_id + 1

            # loop through other elements in global_array
            for i in range(2*N-1):

                if i>2*thread_id:
                    if j<N:
                        cuda.atomic.add(array, (thread_id,j), 1)
                    j+=1
                    cuda.syncthreads()
                else:
                    cuda.syncthreads()

N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)

这当然有效。但是，如果全局_数组的大小大于GPU内核的数量，则当thread_id>GPU内核的数量时，执行将经历syncthreads（）多次不必要的时间。这真是浪费时间

同时，块之间的序列化是不可能的

我有三个问题：

当我使用原子操作时，为什么上面的第一个代码失败了

我们有没有更好的方法来实现这一点

对于第一种方法，如何在块之间序列化

为什么上面的代码会失败

因为CUDA执行模型无法保证线程的运行顺序，而且您对执行顺序的假设可能永远不会成立。此外，代码中的所有内存事务都是非原子的，因此您试图实现的伪自旋锁也无法工作

我们是否有任何切割器[原文如此]的方法来实现这一点

不可以。无法按您要求的方式在Numba CUDA中强制执行顺序。

经过反复试验，似乎无法使用

CUDA.syncthreads（）

方法

但是，这可以通过设置全局数组来实现，在检索任何值之前，我们必须强制执行原子操作：

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # loop through other elements in global_array
            for j in range(thread_id+1, N):

                # consistently read values from array
                for _ in range(1000): # or while True:

                    # for thread_id == 0, just execute
                    if thread_id==0:
                        cuda.atomic.add(array,(thread_id,j), 1)
                        break
                    # for thread_id>0
                    else: 

                        # if j reaches the last number of global_array
                        # just execute
                        if j == N-1:
                            cuda.atomic.add(array,(thread_id,j), 1)
                            break
                        else:  

                            # check if the previous thread_id, i.e., thread_id - 1
                            # finishes the execution of combination [thread_id-1,j+1]
                            cuda.atomic.max(array, (thread_id-1,j+1), 0)
                            if array[thread_id-1,j+1]>0:
                                cuda.atomic.add(array,(thread_id,j), 1)
                                break


N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)

当然，有一种方法可以实现这一点。请查看编辑后的版本。然而，我按照你的建议使用了原子操作，第一种方法仍然不起作用。同时，第二种方法有点慢，这不是我想要的。你的第二种方法非法使用同步线程。只有当条件在threadblock中的计算结果相同时，才可以在条件代码中使用syncthreads。此外，syncthreads只在threadblock内施加顺序，因此您的方法将无法用于在不同threadblock中的线程之间施加顺序。@RobertCrovella，是的，您是对的，块之间的序列化是不可能的，但您知道如何做吗？即使它是非法的，它似乎也有点奏效。@RobertCrovella我根据全局数组的思想写了一篇文章作为答案。你能看一看并发表评论吗？对于一个具有大型执行网格（因此可以同时占用GPU的块数过多）的非平凡案例，这仍然会失败，原因与我在回答中概述的完全相同——你不能保证块将按不会导致错误的顺序运行deadlock@talonmies即使我设置N=20000，上述代码仍然有效。在这个特定问题中，块之间的序列化并不重要。通过强制使用全局数组来保证序列化。通过不断地检索该值，每个线程都有相应的清晰顺序。即使我们有比真正的CUDA内核更多的数字，这仍然可以做到。当然，我同意你的观点，这对于非常大的阵列是不安全的。但无论如何，我没有使用while True，而是使用for循环，这确保线程不会死。

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]

import math
import numpy as np
from numba import cuda

@cuda.jit
def keep_max(global_array,array):
        thread_id = cuda.grid(1)
        if thread_id<N:

            # loop through other elements in global_array
            for j in range(thread_id+1, N):

                # consistently read values from array
                for _ in range(1000): # or while True:

                    # for thread_id == 0, just execute
                    if thread_id==0:
                        cuda.atomic.add(array,(thread_id,j), 1)
                        break
                    # for thread_id>0
                    else: 

                        # if j reaches the last number of global_array
                        # just execute
                        if j == N-1:
                            cuda.atomic.add(array,(thread_id,j), 1)
                            break
                        else:  

                            # check if the previous thread_id, i.e., thread_id - 1
                            # finishes the execution of combination [thread_id-1,j+1]
                            cuda.atomic.max(array, (thread_id-1,j+1), 0)
                            if array[thread_id-1,j+1]>0:
                                cuda.atomic.add(array,(thread_id,j), 1)
                                break


N = 10
global_array = np.arange(N)
array = np.zeros([N,N])

# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock

print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)

[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]