&引用;零拷贝”;在我的OpenCL/Cloo(C#)程序中比非零拷贝慢

&引用;零拷贝”;在我的OpenCL/Cloo(C#)程序中比非零拷贝慢,c#,opencl,cloo,C#,Opencl,Cloo,这可能只是由于.NET framework分配的内存对象没有正确对齐页面造成的问题,但我不明白为什么零拷贝比非零拷贝对我来说慢 我将在这个问题中包含内联代码,但完整的源代码可以在这里看到: 因为这是我第一次尝试让零拷贝工作,所以我写了一个简单的矩阵乘法示例。我首先初始化OpenCL对象: private void Initialize() { // get the intel integrated GPU _integratedIntelGPUPl

这可能只是由于.NET framework分配的内存对象没有正确对齐页面造成的问题,但我不明白为什么零拷贝比非零拷贝对我来说慢

我将在这个问题中包含内联代码,但完整的源代码可以在这里看到:

因为这是我第一次尝试让零拷贝工作,所以我写了一个简单的矩阵乘法示例。我首先初始化OpenCL对象:

    private void Initialize()
    {
        // get the intel integrated GPU
        _integratedIntelGPUPlatform = ComputePlatform.Platforms.Where(n => n.Name.Contains("Intel")).First();

        // create the compute context. 
        _context = new ComputeContext(
            ComputeDeviceTypes.Gpu, // use the gpu
            new ComputeContextPropertyList(_integratedIntelGPUPlatform), // use the intel openCL platform
            null,
            IntPtr.Zero);

        // the command queue is the, well, queue of commands sent to the "device" (GPU)
        _commandQueue = new ComputeCommandQueue(
            _context, // the compute context
            _context.Devices[0], // first device matching the context specifications
            ComputeCommandQueueFlags.None); // no special flags

        string kernelSource = null;
        using (StreamReader sr = new StreamReader("kernel.cl"))
        {
            kernelSource = sr.ReadToEnd();
        }

        // create the "program"
        _program = new ComputeProgram(_context, new string[] { kernelSource });

        // compile. 
        _program.Build(null, null, null, IntPtr.Zero);
        _kernel = _program.CreateKernel("ComputeMatrix");
    }
…如果我的代码尚未初始化,则只执行一次。然后我进入主体。对于非零拷贝,我执行以下操作:

  public float[] MultiplyMatrices(float[] matrix1, float[] matrix2,
  int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
  {
        if (!_initialized)
        {
            Initialize();
            _initialized = true;
        }

        ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width, matrix1Height },
            null, null);

        unsafe
        {
            fixed (float* retPtr = ret)
            {
                _commandQueue.Read(retBuffer,
                    false, 0,
                    ret.Length,
                    new IntPtr(retPtr),
                    null);

                _commandQueue.Finish();
            }
        }

        matrix1Buffer.Dispose();
        matrix2Buffer.Dispose();
        retBuffer.Dispose();

        return ret;
    }
public float[]乘法矩阵(float[]matrix1,float[]matrix2,
整数矩阵x1高度,整数矩阵x1宽度矩阵x2高度,整数矩阵x2宽度)
{
如果(!\u已初始化)
{
初始化();
_初始化=真;
}
ComputeBuffer matrix1Buffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix1);
_SetMemoryArgument(0,matrix1Buffer);
ComputeBuffer matrix2Buffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
矩阵2);
_SetMemoryArgument(1,matrix2Buffer);
float[]ret=新的float[matrix1Height*matrix2Width];
ComputeBuffer retBuffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer,
ret);
_SetMemoryArgument(2,retBuffer);
_SetValueArgument(3,matrix1WidthMatrix2Height);
_SetValueArgument(4,矩阵宽度);
_commandQueue.Execute(_内核,
新的长[]{0},
新长[]{matrix2Width,matrix1Height},
空,空);
不安全的
{
固定(浮动*retPtr=ret)
{
_commandQueue.Read(retBuffer,
错,0,,
回复长度,
新IntPtr(REPTR),
无效);
_commandQueue.Finish();
}
}
matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose();
返回ret;
}
您可以看到我如何为所有ComputeBuffer分配显式设置CopyHostPointer。这执行起来很好

然后我对进行以下调整(包括设置“UseHostPointer”并调用Map/Unmap而不是Read):

public float[]multilymatriceszerocopy(float[]matrix1,float[]matrix2,
整数矩阵x1高度,整数矩阵x1宽度矩阵x2高度,整数矩阵x2宽度)
{
如果(!\u已初始化)
{
初始化();
_初始化=真;
}
ComputeBuffer matrix1Buffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix1);
_SetMemoryArgument(0,matrix1Buffer);
ComputeBuffer matrix2Buffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
矩阵2);
_SetMemoryArgument(1,matrix2Buffer);
float[]ret=新的float[matrix1Height*matrix2Width];
ComputeBuffer retBuffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer,
ret);
_SetMemoryArgument(2,retBuffer);
_SetValueArgument(3,matrix1WidthMatrix2Height);
_SetValueArgument(4,矩阵宽度);
_commandQueue.Execute(_内核,
新的长[]{0},
新长[]{matrix2Width,matrix1Height},
空,空);
IntPtr retPtr=_commandQueue.Map(
retBuffer,
假,,
ComputeMemoryMappingFlags。读取,
0,
返回长度,空);
_Unmap(retBuffer,ref retPtr,null);
_commandQueue.Finish();
matrix1Buffer.Dispose();
matrix2Buffer.Dispose();
retBuffer.Dispose();
返回ret;
}
然而,时间决定一切。我的程序给出了这样的结论:

CPU矩阵乘法:1178.5ms

GPU矩阵乘法(复制):115.1ms

GPU矩阵乘法(零拷贝):174.1ms

GPU(带拷贝)速度快10.23892倍

GPU(零拷贝)速度快6.769098倍


…所以零拷贝速度较慢

多亏了huseyin tugrul buyukisik,我才知道发生了什么事

我需要更新我的英特尔驱动程序。一旦我这样做了,零拷贝就快多了

为了子孙后代,以下是零拷贝代码的最终版本:

    public float[] MultiplyMatricesZeroCopy(float[] matrix1, float[] matrix2,
        int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
    {
        if (!_initialized)
        {
            Initialize();
            _initialized = true;
        }

        ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        GCHandle handle = GCHandle.Alloc(ret, GCHandleType.Pinned); 
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width, matrix1Height },
            null, null);

        IntPtr retPtr = _commandQueue.Map(
            retBuffer,
            true,
            ComputeMemoryMappingFlags.Read,
            0,
            ret.Length, null);

        _commandQueue.Unmap(retBuffer, ref retPtr, null);
        //_commandQueue.Finish();

        matrix1Buffer.Dispose();
        matrix2Buffer.Dispose();
        retBuffer.Dispose();
        handle.Free(); 

        return ret;
    }
public float[]multilymatriceszerocopy(float[]matrix1,float[]matrix2,
整数矩阵x1高度,整数矩阵x1宽度矩阵x2高度,整数矩阵x2宽度)
{
如果(!\u已初始化)
{
初始化();
_初始化=真;
}
ComputeBuffer matrix1Buffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
matrix1);
_SetMemoryArgument(0,matrix1Buffer);
ComputeBuffer matrix2Buffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
矩阵2);
_SetMemoryArgument(1,matrix2Buffer);
float[]ret=新的float[matrix1Height*matrix2Width];
GCHandle handle=GCHandle.Alloc(ret,GCHandleType.pinted);
ComputeBuffer retBuffer=新的ComputeBuffer(_上下文,
ComputeMemoryFlags.UseHostPointer,
ret);
_SetMemoryArgument(2,retBuffer);
_SetValueArgument(3,matrix1WidthMatrix2Height);
_SetValueArgument(4,矩阵宽度);
_commandQ
    public float[] MultiplyMatricesZeroCopy(float[] matrix1, float[] matrix2,
        int matrix1Height, int matrix1WidthMatrix2Height, int matrix2Width)
    {
        if (!_initialized)
        {
            Initialize();
            _initialized = true;
        }

        ComputeBuffer<float> matrix1Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix1);
        _kernel.SetMemoryArgument(0, matrix1Buffer);

        ComputeBuffer<float> matrix2Buffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
            matrix2);
        _kernel.SetMemoryArgument(1, matrix2Buffer);

        float[] ret = new float[matrix1Height * matrix2Width];
        GCHandle handle = GCHandle.Alloc(ret, GCHandleType.Pinned); 
        ComputeBuffer<float> retBuffer = new ComputeBuffer<float>(_context,
            ComputeMemoryFlags.UseHostPointer,
            ret);
        _kernel.SetMemoryArgument(2, retBuffer);

        _kernel.SetValueArgument<int>(3, matrix1WidthMatrix2Height);
        _kernel.SetValueArgument<int>(4, matrix2Width);

        _commandQueue.Execute(_kernel,
            new long[] { 0 },
            new long[] { matrix2Width, matrix1Height },
            null, null);

        IntPtr retPtr = _commandQueue.Map(
            retBuffer,
            true,
            ComputeMemoryMappingFlags.Read,
            0,
            ret.Length, null);

        _commandQueue.Unmap(retBuffer, ref retPtr, null);
        //_commandQueue.Finish();

        matrix1Buffer.Dispose();
        matrix2Buffer.Dispose();
        retBuffer.Dispose();
        handle.Free(); 

        return ret;
    }