如何通过numba在CUDA中顺序执行代码?
我想按顺序执行cuda线程。例如 在上图中,我希望索引为[thread_id,j]的值按顺序馈送,即,仅当数组[0,0]、数组[0,1]、数组[0,2]等给定时,才会给数组[1,2]一个值 我能想到的方法是设置一个全局数组,并不断检索数组[0,3]的值。当数组[0,3]给定时,我可以馈送数组[1,2] 但是,此操作失败,代码如下:如何通过numba在CUDA中顺序执行代码?,cuda,gpu,numba,Cuda,Gpu,Numba,我想按顺序执行cuda线程。例如 在上图中,我希望索引为[thread_id,j]的值按顺序馈送,即,仅当数组[0,0]、数组[0,1]、数组[0,2]等给定时,才会给数组[1,2]一个值 我能想到的方法是设置一个全局数组,并不断检索数组[0,3]的值。当数组[0,3]给定时,我可以馈送数组[1,2] 但是,此操作失败,代码如下: import math import numpy as np from numba import cuda @cuda.jit def keep_max(glob
import math
import numpy as np
from numba import cuda
@cuda.jit
def keep_max(global_array,array):
thread_id = cuda.grid(1)
if thread_id<N:
# loop through other elements in global_array
for j in range(thread_id+1, N):
# consistently read values from array
for _ in range(1000): # or while True:
# for thread_id == 0, just execute
if thread_id==0:
cuda.atomic.add(array,(thread_id,j), 1)
break
# for thread_id>0
else:
# if j reaches the last number of global_array
# just execute
if j == N-1:
cuda.atomic.add(array,(thread_id,j), 1)
break
else:
# check if the previous thread_id, i.e., thread_id - 1
# finishes the execution of combination [thread_id-1,j+1]
if array[thread_id-1,j+1]>0:
cuda.atomic.add(array,(thread_id,j), 1)
break
N = 10
global_array = np.arange(N)
array = np.zeros([N,N])
# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)
output:
[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
我能想到的另一种方法是使用cuda.syncthreads()
,下面是代码:
import math
import numpy as np
from numba import cuda
@cuda.jit
def keep_max(global_array,array):
thread_id = cuda.grid(1)
if thread_id<N:
# j starts from thread_id + 1
j = thread_id + 1
# loop through other elements in global_array
for i in range(2*N-1):
if i>2*thread_id:
if j<N:
cuda.atomic.add(array, (thread_id,j), 1)
j+=1
cuda.syncthreads()
else:
cuda.syncthreads()
N = 10
global_array = np.arange(N)
array = np.zeros([N,N])
# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)
这当然有效。但是,如果全局_数组的大小大于GPU内核的数量,则当thread_id>GPU内核的数量时,执行将经历syncthreads()多次不必要的时间。这真是浪费时间
同时,块之间的序列化是不可能的
我有三个问题:
不可以。无法按您要求的方式在Numba CUDA中强制执行顺序。经过反复试验,似乎无法使用
CUDA.syncthreads()
方法
但是,这可以通过设置全局数组来实现,在检索任何值之前,我们必须强制执行原子操作:
import math
import numpy as np
from numba import cuda
@cuda.jit
def keep_max(global_array,array):
thread_id = cuda.grid(1)
if thread_id<N:
# loop through other elements in global_array
for j in range(thread_id+1, N):
# consistently read values from array
for _ in range(1000): # or while True:
# for thread_id == 0, just execute
if thread_id==0:
cuda.atomic.add(array,(thread_id,j), 1)
break
# for thread_id>0
else:
# if j reaches the last number of global_array
# just execute
if j == N-1:
cuda.atomic.add(array,(thread_id,j), 1)
break
else:
# check if the previous thread_id, i.e., thread_id - 1
# finishes the execution of combination [thread_id-1,j+1]
cuda.atomic.max(array, (thread_id-1,j+1), 0)
if array[thread_id-1,j+1]>0:
cuda.atomic.add(array,(thread_id,j), 1)
break
N = 10
global_array = np.arange(N)
array = np.zeros([N,N])
# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)
当然,有一种方法可以实现这一点。请查看编辑后的版本。然而,我按照你的建议使用了原子操作,第一种方法仍然不起作用。同时,第二种方法有点慢,这不是我想要的。你的第二种方法非法使用同步线程。只有当条件在threadblock中的计算结果相同时,才可以在条件代码中使用syncthreads。此外,syncthreads只在threadblock内施加顺序,因此您的方法将无法用于在不同threadblock中的线程之间施加顺序。@RobertCrovella,是的,您是对的,块之间的序列化是不可能的,但您知道如何做吗?即使它是非法的,它似乎也有点奏效。@RobertCrovella我根据全局数组的思想写了一篇文章作为答案。你能看一看并发表评论吗?对于一个具有大型执行网格(因此可以同时占用GPU的块数过多)的非平凡案例,这仍然会失败,原因与我在回答中概述的完全相同——你不能保证块将按不会导致错误的顺序运行deadlock@talonmies即使我设置N=20000,上述代码仍然有效。在这个特定问题中,块之间的序列化并不重要。通过强制使用全局数组来保证序列化。通过不断地检索该值,每个线程都有相应的清晰顺序。即使我们有比真正的CUDA内核更多的数字,这仍然可以做到。当然,我同意你的观点,这对于非常大的阵列是不安全的。但无论如何,我没有使用while True,而是使用for循环,这确保线程不会死。
[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]
import math
import numpy as np
from numba import cuda
@cuda.jit
def keep_max(global_array,array):
thread_id = cuda.grid(1)
if thread_id<N:
# loop through other elements in global_array
for j in range(thread_id+1, N):
# consistently read values from array
for _ in range(1000): # or while True:
# for thread_id == 0, just execute
if thread_id==0:
cuda.atomic.add(array,(thread_id,j), 1)
break
# for thread_id>0
else:
# if j reaches the last number of global_array
# just execute
if j == N-1:
cuda.atomic.add(array,(thread_id,j), 1)
break
else:
# check if the previous thread_id, i.e., thread_id - 1
# finishes the execution of combination [thread_id-1,j+1]
cuda.atomic.max(array, (thread_id-1,j+1), 0)
if array[thread_id-1,j+1]>0:
cuda.atomic.add(array,(thread_id,j), 1)
break
N = 10
global_array = np.arange(N)
array = np.zeros([N,N])
# Configure the blocks
threadsperblock = 64
# configure the grids
blockspergrid = (N + (threadsperblock - 1)) // threadsperblock
print(global_array)
keep_max[blockspergrid, threadsperblock](global_array,array)
print(array)
[0 1 2 3 4 5 6 7 8 9]
[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 1. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 1. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 1. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 1. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 1. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 1. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]]