Memory CUDA:\uuuuSyncThreads()在共享内存操作之前?

Memory CUDA:\uuuuSyncThreads()在共享内存操作之前?,memory,concurrency,cuda,shared,Memory,Concurrency,Cuda,Shared,我的处境相当糟糕,无法使用CUDA调试器。在使用单个共享数组(delta)的应用程序中使用_syncthreads,我得到了一些奇怪的结果。以下代码在循环中执行: __syncthreads(); //if I comment this out, things get funny deltas[lex_index_block] = intensity - mean; __syncthreads(); //this line doesnt seem to matter regardless if

我的处境相当糟糕,无法使用CUDA调试器。在使用单个共享数组(delta)的应用程序中使用_syncthreads,我得到了一些奇怪的结果。以下代码在循环中执行:

__syncthreads(); //if I comment this out, things get funny
deltas[lex_index_block] = intensity - mean;
__syncthreads(); //this line doesnt seem to matter regardless if the first sync is commented out or not
//after sync: do something with the values of delta written in this threads and other threads of this block
基本上,我有重叠块的代码(由于算法的性质需要)。该程序确实编译并运行,但不知何故,我在垂直重叠区域得到了系统性错误的值。这让我非常困惑,因为我认为正确的同步方式是在线程对共享内存执行写操作之后进行同步

这就是整个功能:

//XC without repetitions
template <int blocksize, int order>
__global__ void __xc(unsigned short* raw_input_data, int num_frames, int width, int height,
                 float * raw_sofi_data, int block_size, int order_deprecated){

//we make a distinction between real pixels and virtual pixels
//real pixels are pixels that exist in the original data

//overlap correction: every new block has a margin of 3 threads doing less work (only computing deltas)
int x_corrected = global_x() - blockIdx.x * 3;
int y_corrected = global_y() - blockIdx.y * 3;

//if the thread is responsible for any real pixel
if (x_corrected < width && y_corrected < height){

    //        __shared__ float deltas[blocksize];
    __shared__ float deltas[blocksize];

    //the outer pixels of a block do not update SOFI values as they do not have sufficient information available
    //they are used only to compute mean and delta
    //also, pixels at the global edge have to be thrown away (as there is not sufficient data to interpolate)
    bool within_inner_block =
            threadIdx.x > 0
            && threadIdx.y > 0
            && threadIdx.x < blockDim.x - 2
            && threadIdx.y < blockDim.y - 2
            //global edge
            && x_corrected > 0
            && y_corrected > 0
            && x_corrected < width - 1
            && y_corrected < height - 1
            ;


    //init virtual pixels
    float virtual_pixels[order * order];
    if (within_inner_block){
        for (int i = 0; i < order * order; ++i) {
            virtual_pixels[i] = 0;
        }
    }


    float mean = 0;
    float intensity;
    int lex_index_block = threadIdx.x + threadIdx.y * blockDim.x;



    //main loop
    for (int frame_idx = 0; frame_idx < num_frames; ++frame_idx) {

        //shared memory read and computation of mean/delta
        intensity = raw_input_data[lex_index_3D(x_corrected,y_corrected, frame_idx, width, height)];

        __syncthreads(); //if I comment this out, things break
        deltas[lex_index_block] = intensity - mean;
        __syncthreads(); //this doesnt seem to matter

        mean = deltas[lex_index_block]/(float)(frame_idx+1);

        //if the thread is responsible for correlated pixels, i.e. not at the border of the original frame
        if (within_inner_block){
            //WORKING WITH DELTA STARTS HERE
            virtual_pixels[0] += deltas[lex_index_2D(
                        threadIdx.x,
                        threadIdx.y + 1,
                        blockDim.x)]
                    *
                    deltas[lex_index_2D(
                        threadIdx.x,
                        threadIdx.y - 1,
                        blockDim.x)];

            virtual_pixels[1] += deltas[lex_index_2D(
                        threadIdx.x,
                        threadIdx.y,
                        blockDim.x)]
                    *
                    deltas[lex_index_2D(
                        threadIdx.x + 1,
                        threadIdx.y,
                        blockDim.x)];

            virtual_pixels[2] += deltas[lex_index_2D(
                        threadIdx.x,
                        threadIdx.y,
                        blockDim.x)]
                    *
                    deltas[lex_index_2D(
                        threadIdx.x,
                        threadIdx.y + 1,
                        blockDim.x)];

            virtual_pixels[3] += deltas[lex_index_2D(
                        threadIdx.x,
                        threadIdx.y,
                        blockDim.x)]
                    *
                    deltas[lex_index_2D(
                        threadIdx.x+1,
                        threadIdx.y+1,
                        blockDim.x)];
            //                xc_update<order>(virtual_pixels, delta2, mean);
        }
    }

    if (within_inner_block){
        for (int virtual_idx = 0; virtual_idx < order*order; ++virtual_idx) {
            raw_sofi_data[lex_index_2D(x_corrected*order + virtual_idx % order,
                                       y_corrected*order + (int)floorf(virtual_idx / order),
                                       width*order)]=virtual_pixels[virtual_idx];
        }
    }
}
}
//XC不重复
模板
__全局无效xc(无符号短*原始输入数据,整数帧,整数宽度,整数高度,
浮点*原始sofi数据、整数块大小、整数顺序(已弃用){
//我们区分了真实像素和虚拟像素
//真实像素是存在于原始数据中的像素
//重叠校正:每个新块有3个线程的余量,做的工作更少(仅计算增量)
int x_corrected=global_x()-blockIdx.x*3;
int y_corrected=global_y()-blockIdx.y*3;
//如果线程负责任何实际像素
如果(x_校正<宽度和y_校正<高度){
//_u共享_uu浮点增量[块大小];
__共享浮点数增量[块大小];
//块的外部像素不会更新SOFI值,因为它们没有足够的可用信息
//它们仅用于计算平均值和增量
//此外,必须丢弃全局边缘的像素(因为没有足够的数据进行插值)
内部块内的布尔=
threadIdx.x>0
&&threadIdx.y>0
&&螺纹内径x.x0
&&y_校正>0
&&x_校正<宽度-1
&&y_校正<高度-1
;
//初始化虚拟像素
浮动虚拟像素[顺序*顺序];
if(内部块内){
对于(int i=0;i
从我看到的情况来看,在循环迭代之间,应用程序可能存在危险。循环迭代
帧idx+1
增量[lex_index_block]
的写入可以映射到与读取
增量[lex_index_2D(threadIdx.x,threadIdx.y-1,blockDim.x)]
在迭代
帧idx
的不同线程中相同的位置。这两个访问是无序的,结果是不确定的。尝试使用
cuda memcheck--tool racecheck运行应用程序

据我所见,在循环迭代之间,应用程序可能存在危险。循环迭代
帧idx+1
增量[lex_index_block]
的写入可以映射到与读取
增量[lex_index_2D(threadIdx.x,threadIdx.y-1,blockDim.x)]
在迭代
帧idx
的不同线程中相同的位置。这两个访问是无序的,结果是不确定的。尝试使用
cuda memcheck--tool racecheck运行应用程序

据我所见,在循环迭代之间,应用程序可能存在危险。令状