C++ 将缓冲区写入设备时发生OpenCL访问冲突_C++_Opencl

C++ 将缓冲区写入设备时发生OpenCL访问冲突

c++ opencl

C++ 将缓冲区写入设备时发生OpenCL访问冲突,c++,opencl,C++,Opencl,我在OpenCL有一个项目。这是GPU上的矩阵分解。一切正常，结果也不错。我看到的唯一一件事是，当我连续多次执行程序（大约每秒执行一次）时，当我将初始缓冲区写入设备时，就会出现访问冲突总是在写缓冲区的时候才会卡住。我是OpenCL的新手，我想知道当我退出程序时是否需要清除GPU中的内存？有时它在第一次运行时崩溃，但在尝试2或3次后成功。然后，有时是立即成功，以及随后的运行。这完全是随机的。失败的实际缓冲区写入有时也会有所不同。有时是第三个缓冲区写入失败，有时是第四个缓冲区写入失败我运行这个程

我在OpenCL有一个项目。这是GPU上的矩阵分解。一切正常，结果也不错。我看到的唯一一件事是，当我连续多次执行程序（大约每秒执行一次）时，当我将初始缓冲区写入设备时，就会出现访问冲突

总是在写缓冲区的时候才会卡住。我是OpenCL的新手，我想知道当我退出程序时是否需要清除GPU中的内存？有时它在第一次运行时崩溃，但在尝试2或3次后成功。然后，有时是立即成功，以及随后的运行。这完全是随机的。失败的实际缓冲区写入有时也会有所不同。有时是第三个缓冲区写入失败，有时是第四个缓冲区写入失败

我运行这个程序时使用的参数是7的工作组大小和70*70个元素的矩阵。起初，我认为可能是我的矩阵对于GPU来说太大了（2GB的GT650M），但有时使用矩阵ox 10.000元素运行也会成功

下面给出了缓冲区写入之前的代码

非常感谢您的帮助

Ps：为了清晰起见，

精度

是一个宏

#定义精度浮点

int main(int argc, char *argv[])
{
    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    //// INITIALIZATION PART ///////////////////////////////////////////////////////////////////////////////////////
    ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
    try {
        if (argc != 5) {
            std::ostringstream oss;
            oss << "Usage: " << argv[0] << " <kernel_file> <kernel_name> <workgroup_size> <array width>";
            throw std::runtime_error(oss.str());
        }
        // Read in arguments.
        std::string kernel_file(argv[1]);
        std::string kernel_name(argv[2]);
        unsigned int workgroup_size = atoi(argv[3]);
        unsigned int array_dimension = atoi(argv[4]);
        int total_matrix_length = array_dimension * array_dimension;

        int total_workgroups = total_matrix_length / workgroup_size;
        total_workgroups += total_matrix_length % workgroup_size == 0 ? 0 : 1;

        // Print parameters
        std::cout << "Workgroup size:  "   << workgroup_size      << std::endl;
        std::cout << "Total workgroups:  " << total_workgroups    << std::endl;
        std::cout << "Array dimension: "   << array_dimension     << " x " << array_dimension << std::endl;
        std::cout << "Total elements:  "   << total_matrix_length << std::endl;


        // OpenCL initialization
        std::vector<cl::Platform> platforms;
        std::vector<cl::Device> devices;
        cl::Platform::get(&platforms);
        platforms[0].getDevices(CL_DEVICE_TYPE_GPU, &devices);
        cl::Context context(devices);
        cl::CommandQueue queue(context, devices[0], CL_QUEUE_PROFILING_ENABLE);

        // Load the kernel source.
        std::string file_text;
        std::ifstream file_stream(kernel_file.c_str());
        if (!file_stream) {
            std::ostringstream oss;
            oss << "There is no file called " << kernel_file;
            throw std::runtime_error(oss.str());
        }
        file_text.assign(std::istreambuf_iterator<char>(file_stream), std::istreambuf_iterator<char>());

        // Compile the kernel source.
        std::string source_code = file_text;
        std::pair<const char *, size_t> source(source_code.c_str(), source_code.size());
        cl::Program::Sources sources;
        sources.push_back(source);
        cl::Program program(context, sources);
        try {
            program.build(devices);
        }
        catch (cl::Error& e) {
            getchar();
            std::string msg;
            program.getBuildInfo<std::string>(devices[0], CL_PROGRAM_BUILD_LOG, &msg);
            std::cerr << "Your kernel failed to compile" << std::endl;
            std::cerr << "-----------------------------" << std::endl;
            std::cerr << msg;
            throw(e);
        }
        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
        //// CREATE RANDOM INPUT DATA //////////////////////////////////////////////////////////////////////////////////
        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////

        // Create matrix to work on.
        // Create a random array.
        int matrix_width         = sqrt(total_matrix_length);
        PRECISION* random_matrix = new PRECISION[total_matrix_length];
        random_matrix            = randommatrix(total_matrix_length);
        PRECISION* A             = new PRECISION[total_matrix_length];

        for (int i = 0; i < total_matrix_length; i++)
            A[i] = random_matrix[i];

        PRECISION* L_SEQ = new PRECISION[total_matrix_length];
        PRECISION* U_SEQ = new PRECISION[total_matrix_length];
        PRECISION* P_SEQ = new PRECISION[total_matrix_length];

        // Do the sequential algorithm.
        decompose(A, L_SEQ, U_SEQ, P_SEQ, matrix_width);
        float* PA = multiply(P_SEQ, A, total_matrix_length);
        float* LU = multiply(L_SEQ, U_SEQ, total_matrix_length);
        std::cout << "PA = LU?" << std::endl;
        bool eq = equalMatrices(PA, LU, total_matrix_length);
        std::cout << eq << std::endl;
        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////
        //// RUN AND SETUP KERNELS /////////////////////////////////////////////////////////////////////////////////////
        ////////////////////////////////////////////////////////////////////////////////////////////////////////////////

        // Initialize arrays for GPU.
        PRECISION* L_PAR = new PRECISION[total_matrix_length];
        PRECISION* U_PAR = new PRECISION[total_matrix_length];
        PRECISION* P_PAR = new PRECISION[total_matrix_length];

        PRECISION* ROW_IDX = new PRECISION[matrix_width];
        PRECISION* ROW_VAL = new PRECISION[matrix_width];
        // Write A to U and initialize P.
        for (int i = 0; i < total_matrix_length; i++)
            U_PAR[i] = A[i];
        // Initialize P_PAR.
        for (int row = 0; row < matrix_width; row++)
        {
            for (int i = 0; i < matrix_width; i++)
                IDX(P_PAR, row, i) = 0;
            IDX(P_PAR, row, row) = 1;
        }
        // Allocate memory on the device
        cl::Buffer P_BUFF(context, CL_MEM_READ_WRITE, total_matrix_length*sizeof(PRECISION));
        cl::Buffer L_BUFF(context, CL_MEM_READ_WRITE, total_matrix_length*sizeof(PRECISION));
        cl::Buffer U_BUFF(context, CL_MEM_READ_WRITE, total_matrix_length*sizeof(PRECISION));
        // Buffer to determine maximum row value.
        cl::Buffer MAX_ROW_IDX_BUFF(context, CL_MEM_READ_WRITE, total_workgroups*sizeof(PRECISION));
        cl::Buffer MAX_ROW_VAL_BUFF(context, CL_MEM_READ_WRITE, total_workgroups*sizeof(PRECISION));

        // Create the actual kernels.
        cl::Kernel kernel(program, kernel_name.c_str());

        std::string max_row_kernel_name = "max_row";
        cl::Kernel max_row(program, max_row_kernel_name.c_str());
        std::string swap_row_kernel_name = "swap_row";
        cl::Kernel swap_row(program, swap_row_kernel_name.c_str());

        // transfer source data from the host to the device
        std::cout << "Writing buffers" << std::endl;
        queue.enqueueWriteBuffer(P_BUFF, CL_TRUE, 0, total_matrix_length*sizeof(PRECISION), P_PAR);
        queue.enqueueWriteBuffer(L_BUFF, CL_TRUE, 0, total_matrix_length*sizeof(PRECISION), L_PAR);
        queue.enqueueWriteBuffer(U_BUFF, CL_TRUE, 0, total_matrix_length*sizeof(PRECISION), U_PAR);

        queue.enqueueWriteBuffer(MAX_ROW_IDX_BUFF, CL_TRUE, 0, total_workgroups*sizeof(PRECISION), ROW_IDX);
        queue.enqueueWriteBuffer(MAX_ROW_VAL_BUFF, CL_TRUE, 0, total_workgroups*sizeof(PRECISION), ROW_VAL);

调试器向我显示的函数如下所示，位于命名空间

cl

：

cl_int enqueueWriteBuffer(
    const Buffer& buffer,
    cl_bool blocking,
    ::size_t offset,
    ::size_t size,
    const void* ptr,
    const VECTOR_CLASS<Event>* events = NULL,
    Event* event = NULL) const
{
    return detail::errHandler(
        ::clEnqueueWriteBuffer(
            object_, buffer(), blocking, offset, size,
            ptr,
            (events != NULL) ? (cl_uint) events->size() : 0,
            (events != NULL && events->size() > 0) ? (cl_event*) &events->front() : NULL,
            (cl_event*) event),
            __ENQUEUE_WRITE_BUFFER_ERR);

cl_int-enqueueWriteBuffer(
常量缓冲区和缓冲区，
cl_bool blocking，
：：大小\u t偏移，
：：大小，
const void*ptr，
常量向量_类*事件=NULL，
事件*事件=NULL）常量
{
返回详细信息：：errHandler(
：：克伦奎布吕弗(
对象，缓冲区（），阻塞，偏移，大小，
ptr，
（events！=NULL）？（cl_uint）events->size（）：0，
（事件！=NULL&&events->size（）>0）？（cl_事件*）&events->front（）：NULL，
（cl_事件*）事件），
__排队（写入缓冲区错误）；

编辑：完整源代码。

仅仅因为在将缓冲区排队时发生错误，这不一定是原因。您可能已经损坏了内存，而错误只是由于排队过程（与CPU内存损坏非常类似，空闲调用会引发错误）

所有CL函数都返回错误代码，请通过将它们与

CL\u SUCCESS

进行比较来评估它们（）。例如，如果内核调用确实损坏了内存，enqueueReadBuffer通常会返回

CL\u INVALID\u COMMAND\u QUEUE

根据您对问题的描述，我假设您实际上反复启动内核，但是我没有看到相应的代码

最可能的原因是：您在内核中的内存访问超出了范围并损坏了内存。由于您不评估错误代码并继续执行程序，因此驱动程序迟早会报告错误（或只是崩溃），

但从这里开始，我们可能已经在处理未定义的行为，因此驾驶员说什么并不重要。

看看下面几行：

PRECISION* ROW_IDX = new PRECISION[matrix_width];
...
cl::Buffer MAX_ROW_IDX_BUFF(context, CL_MEM_READ_WRITE, total_workgroups*sizeof(PRECISION));
...
queue.enqueueWriteBuffer(MAX_ROW_IDX_BUFF, CL_TRUE, 0, total_workgroups*sizeof(PRECISION), ROW_IDX);

因此，您试图将

total_workgroups

元素写入缓冲区，但源数组仅分配了

matrix_width

元素。对于您提到的输入参数（70x70数组，工作组大小为7），这将尝试从

70*4

字节数组读取

700*4

字节的数据-明确的内存访问冲突

稍后在代码中，您从同一缓冲区读取到同一主机阵列，这将损坏内存，并在我自己的系统上运行您的代码时导致各种其他崩溃和无法解释的行为。

我发布的代码就是所有发生的事情。因此，在我执行内核之前会引发错误。好的，然后计算返回值

queue.enqueueWriteBuffer的s应该会有帮助。非常感谢你的帮助，即使这是一篇旧文章，也真的救了我一命！
PRECISION* ROW_IDX = new PRECISION[matrix_width];
...
cl::Buffer MAX_ROW_IDX_BUFF(context, CL_MEM_READ_WRITE, total_workgroups*sizeof(PRECISION));
...
queue.enqueueWriteBuffer(MAX_ROW_IDX_BUFF, CL_TRUE, 0, total_workgroups*sizeof(PRECISION), ROW_IDX);