Python 2.7 cuda GPU加速代码中的不一致结果_Python 2.7_Cuda_Numba_Pycuda_Numba Pro

Python 2.7 cuda GPU加速代码中的不一致结果

python-2.7 cuda

Python 2.7 cuda GPU加速代码中的不一致结果,python-2.7,cuda,numba,pycuda,numba-pro,Python 2.7,Cuda,Numba,Pycuda,Numba Pro,我试图在我的GPU上计算图像的本地二进制模式，同时使用python中的cuda模块。但在CPU和GPU上执行类似算法所产生的结果是不同的。你能帮我解决这个问题吗下面是我试图执行的代码片段： from __future__ import division from skimage.io import imread, imshow from numba import cuda import time import math import numpy # CUDA Kernel @cuda.jit

我试图在我的GPU上计算图像的本地二进制模式，同时使用python中的cuda模块。但在CPU和GPU上执行类似算法所产生的结果是不同的。你能帮我解决这个问题吗

下面是我试图执行的代码片段：

from __future__ import division
from skimage.io import imread, imshow
from numba import cuda
import time
import math
import numpy

# CUDA Kernel
@cuda.jit
def pointKernelLBP(imgGPU, histVec, pos) :
    ''' Computes Point Local Binary Pattern '''
    row, col = cuda.grid(2)
    if row+1 < imgGPU.shape[0] and col+1 < imgGPU.shape[1] and col-1>=0 and row-1>=0 :
        curPos = 0
        mask = 0
        for i in xrange(-1, 2) :
            for j in xrange(-1, 2) :
                if i==0 and j==0 :
                    continue
                if imgGPU[row+i][col+j] > imgGPU[row][col] :
                    mask |= (1<<curPos)     
                curPos+=1
        histVec[mask]+=1


#Host Code for computing LBP 
def pointLBP(x, y, img) :
    ''' Computes Local Binary Pattern around a point (x,y),
    considering 8 nearest neighbours '''
    pos = [0, 1, 2, 7, 3, 6, 5, 4]  
    curPos = 0
    mask = 0
    for i in xrange(-1, 2) :
        for j in xrange(-1, 2) :
            if i==0 and j==0 :
                continue
            if img[x+i][y+j] > img[x][y] :
                mask |= (1<<curPos)         
            curPos+=1
    return mask                 

def LBPHistogram(img, n, m) :
    ''' Computes LBP Histogram for given image '''
    HistVec = [0] * 256 
    for i in xrange(1, n-1) :
        for j in xrange(1, m-1) :
            HistVec[ pointLBP(i, j, img) ]+=1
    return HistVec

if __name__ == '__main__' :

    # Reading Image
    img = imread('cat.jpg', as_grey=True)
    n, m = img.shape

    start = time.time() 
    imgHist = LBPHistogram(img, n, m)
    print "Computation time incurred on CPU : %s seconds.\n" % (time.time() - start)    

    print "LBP Hisogram Vector Using CPU :\n"
    print imgHist

    print type(img)

    pos = numpy.ndarray( [0, 1, 2, 7, 3, 6, 5, 4] )

    img_global_mem = cuda.to_device(img)
    imgHist_global_mem = cuda.to_device(numpy.full(256, 0, numpy.uint8))
    pos_global_mem = cuda.to_device(pos)

    threadsperblock = (32, 32)
    blockspergrid_x = int(math.ceil(img.shape[0] / threadsperblock[0]))
    blockspergrid_y = int(math.ceil(img.shape[1] / threadsperblock[1]))
    blockspergrid = (blockspergrid_x, blockspergrid_y)

    start = time.time() 
    pointKernelLBP[blockspergrid, threadsperblock](img_global_mem, imgHist_global_mem, pos_global_mem)
    print "Computation time incurred on GPU : %s seconds.\n" % (time.time() - start)

    imgHist = imgHist_global_mem.copy_to_host()

    print "LBP Histogram as computed on GPU's : \n"
    print imgHist, len(imgHist)

来自未来进口部的


从skimage.io导入imread、imshow
来自numba import cuda
导入时间
输入数学
进口numpy
#CUDA内核
@cuda.jit
def pointKernelLBP（imgGPU、histVec、pos）：
''计算点本地二进制模式''
行，列=cuda.grid（2）
如果行+1=0和行-1>=0：
curPos=0
掩码=0
对于X范围内的i（-1,2）：
对于X范围内的j（-1,2）：
如果i==0和j==0：
持续
如果imgGPU[row+i][col+j]>imgGPU[row][col]：
mask |=（1既然您已经修复了发布的原始内核代码中明显的错误，那么有两个问题正在阻止此代码正常工作
第一个，也是最严重的一个，是内核中的内存竞争。直方图箱的更新：
histVec[mask]+=1

不安全。多个块中的多个线程将尝试在全局内存中同时读取和写入相同的bin计数器。CUDA在这种情况下不保证正确性或可重复性
最简单的解决方案（但不一定是性能最好的，具体取决于您的硬件）是使用原子内存事务。原子内存事务确实可以保证增量操作将被序列化，但当然，序列化会带来一些性能损失。您可以通过将更新代码更改为以下内容来做到这一点：
cuda.atomic.add(histVec,mask,1)

请注意，CUDA仅支持32位和64位原子内存事务，因此需要确保histVec
的类型是兼容的32位或64位整数类型
这导致了第二个问题，即您已将bin计数器向量定义为numpy.uint8
。这意味着即使没有内存竞争，您也只有8位来存储计数，并且它们会很快溢出任何有意义大小的图像。因此，对于与原子内存事务的兼容性为防止计数器翻转，您需要更改计数器的类型
当我在您发布的代码中更改这些内容（并修复了早期丢失的代码问题）时，我可以在GPU和主机代码计算直方图之间获得完全一致的随机8位输入数组
对于CUDA，基本的平行直方图问题已经有了很好的描述，例如，当您开始担心性能时，您可以研究很多示例和代码库。
请正确设置代码格式我发现您的代码中至少有三个问题，最明显的是，实际的直方图代码不是内核和主机代码之间是相同的。你怎么能期望它们产生相同的输出？对不起，我找不到任何区别？请你提一个。如果你看不到内核中的内部直方图代码和pointLBP中的其他相同代码之间的区别，那么要么你没有尝试，要么你帮不上忙。这不是一个微不足道的错误发现服务，请不要将其视为一个。对此，我很抱歉。现在，我已经包含了所需的编辑，但即使现在结果也不一致。请您现在提供帮助。