C# 在OpenCL中使用本地工人进行大型矩阵计算_C#_Opencl_Gpu_Opencl.net

C# 在OpenCL中使用本地工人进行大型矩阵计算

c# opencl

C# 在OpenCL中使用本地工人进行大型矩阵计算,c#,opencl,gpu,opencl.net,C#,Opencl,Gpu,Opencl.net,我是一个在Visual Studio C#中使用OpenCL（带有OpenCL.NET库）的新手，目前正在开发一个计算大型3D矩阵的应用程序。在矩阵中的每个像素处，计算192个唯一值，然后求和以产生该像素的最终值。所以，在功能上，它就像一个四维矩阵，（161x161x161）x192 现在，我正在从主机代码中调用内核，如下所示： //C# host code ... float[] BigMatrix = new float[161*161*161]; //1-D result array CL

我是一个在Visual Studio C#中使用OpenCL（带有OpenCL.NET库）的新手，目前正在开发一个计算大型3D矩阵的应用程序。在矩阵中的每个像素处，计算192个唯一值，然后求和以产生该像素的最终值。所以，在功能上，它就像一个四维矩阵，（161x161x161）x192

现在，我正在从主机代码中调用内核，如下所示：

//C# host code
...
float[] BigMatrix = new float[161*161*161]; //1-D result array
CLCalc.Program.Variable dev_BigMatrix = new CLCalc.Program.Variable(BigMatrix);
CLCalc.Program.Variable dev_OtherArray = new CLCalc.Program.Variable(otherArray);
//...load some other variables here too.
CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[7] {//stuff...}

//Here, I execute the kernel, with a 2-dimensional worker pool:
BigMatrixCalc.Execute(args, new int[2]{N*N*N,192});
dev_BigMatrix.ReadFromDeviceTo(BigMatrix);

下面是内核代码示例

__kernel void MyKernel(
__global float * BigMatrix
__global float * otherArray
//various other variables...
)
{
    int N = 161; //Size of matrix edges
    int pixel_id = get_global_id(0); //The location of the pixel in the 1D array
    int array_id = get_global_id(1); //The location within the otherArray


    //Finding the x,y,z values of the pixel_id.
    float3 p;
    p.x = pixel_id % N;    
    p.y = ((pixel_id % (N*N))-p.x)/N;
    p.z = (pixel_id - p.x - p.y*N)/(N*N);

    float result;

    //...
    //Some long calculation for 'result' involving otherArray and p...
    //...

    BigMatrix[pixel_id] += result;
}

我的代码目前运行正常，但我正在寻找此应用程序的速度，我不确定我的工作人员/组设置是否是最佳方法（即161*161*161和192表示工作人员池的维度）

我见过其他将全局工作人员池组织到本地工作人员组以提高效率的例子，但我不太确定如何在OpenCL.NET中实现这一点。我也不确定这与仅仅在worker池中创建另一个维度有何不同

所以，我的问题是：我可以在这里使用本地组吗？如果可以，我将如何组织它们？一般来说，使用本地组与仅调用n维工作池有何不同？（即调用Execute（args，newint[]{（N*N*N），192}），而本地工作组大小为192？）

谢谢你的帮助

我有几点建议给你：

我认为你的代码有竞争条件。您的最后一行代码中的BigMatrix元素被多个不同的工作项修改

如果您的矩阵真的是161x161x161，那么这里有很多工作项可以将这些维度用作您的唯一维度。您已经有超过400万个工作项，这对于您的机器来说应该是足够并行的。你不需要192倍。另外，如果不将单个像素的计算拆分为多个工作项，则不需要同步最终添加

如果您的全局工作大小不是2的一个很好的倍数，您可以尝试将其填充，使其成为1。即使您将NULL作为本地工作大小传递，一些OpenCL实现也会选择效率低下的本地大小作为划分不好的全局大小

如果您的算法不需要本地内存或屏障，您几乎可以跳过本地工作组

希望这有帮助

我认为等待内存访问会损失很多性能。我已经回答了一个问题，希望我的帖子能帮助你。请问你有什么问题

优化：

在我的内核版本中，最大的提升来自于将otherArray读入本地内存

每个工作项在BigMatrix中计算4个值。这意味着它们可以同时写入同一缓存线上。并行性损失最小，因为仍有超过1M个工作项要执行

定义N 161 #定义Nsqr N*N #定义Ncub N*N*N #定义其他大小192 __内核无效MyKernel（_全局浮点*BigMatrix，_全局浮点*其他数组） { //使用矩阵总大小的四分之一 //此工作项将负责计算BigMatrix中的4个连续值 //对于N=161，还将全局大小减少到（N^3）/4~=1043000 int global_id=get_global_id（0）*4；//1D数组中第一个像素的位置 int像素_id； //不再使用数组\u id。工作项将完全处理BigMatrix[pixel\u id] int local_id=get_local_id（0）；//组内的工作项id int local_size=get_local_size（0）；//组的大小浮点结果[4]；//缓存4个全局值的结果 int i，j； 3p； //将otherArray中的值缓存到本地内存 //现在，组中的每个工作项都将能够有效地读取值 //otherArray中的每个元素将被读取N^3次，因此这一点很重要 //opencl指定至少16kb的本地内存，因此最多4k的浮点值可以正常工作 __本地浮点OtherValue[其他大小]；

对于（i=local_id；i BigMatrix中的值是根据BigMatrix中的任何其他值计算出来的吗？在计算中如何使用“p”？您能否提供有关您尝试进行的计算的更多信息？当然。BigMatrix的值不用于计算，而仅用于索引。BigMatrix的值最初为0，并设置为计算。计算使用BigMatrix中当前像素的索引（p.x，p.y，p.z）找到指向otherArray中某个值指定的另一点的向量。因此，每次计算都是唯一的，因为每个像素指向otherArray中192个点中的每个点都有一个唯一的向量。此向量的大小和距离将用于BigMatrix中最终值的最终计算。感谢您的回复。我喜欢usin的想法g atomic_add，但是它似乎只适用于int类型。我的计算必须是浮点计算，所以我需要能够进行涉及浮点的同步加法。有没有可以添加浮点的原子_add的替代方案？呃。很好。不，OpenCL中不支持浮点原子。考虑到这一点，我真的会考虑r刚刚启动161x161x161个工作项。#2我同意。展开192个循环有点过分。#3或者，计算最大的全局工作大小，并将剩余的工作分配给CPU内核。#4我不同意这一点。我将发布我的解决方案；它依赖于局部变量来大大加快速度。我不怀疑使用局部变量可以我的声明4说，如果你不使用局部变量或障碍，不要为工作组操心太多。对。我想这不是一个要求。谢谢你的回答！但是我确实有问题，因为我似乎无法让你的设置使用我的代码：1）查看你的代码，每个工作线程都会创建一个新的缓存“otherValues”矩阵，但我不明白为什么缓存数组的大小仍然是192…您不是只填充（192/local_size）元素吗？我想其余的元素都是

#define N 161
#define Nsqr N*N
#define Ncub N*N*N
#define otherSize 192

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)
{
    //using 1 quarter of the total size of the matrix
    //this work item will be responsible for computing 4 consecutive values in BigMatrix
    //also reduces global size to (N^3)/4  ~= 1043000 for N=161

    int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
    int pixel_id;
    //array_id won't be used anymore. work items will process BigMatrix[pixel_id] entirely

    int local_id = get_local_id(0); //work item id within the group
    int local_size = get_local_size(0); //size of group


    float result[4]; //result cached for 4 global values
    int i, j;
    float3 p;

    //cache the values in otherArray to local memory
    //now each work item in the group will be able to read the values efficently
    //each element in otherArray will be read a total of N^3 times, so this is important
    //opencl specifies at least 16kb of local memory, so up to 4k floats will work fine
    __local float otherValues[otherSize];
    for(i=local_id; i<otherSize; i+= local_size){
        otherValues[i] = otherArray[i];
    }
    mem_fence(CLK_LOCAL_MEM_FENCE);

    //now this work item can compute the complete result for pixel_id 
    for(j=0;j<4;j++){
        result[j] = 0;
        pixel_id = global_id + j;

        //Finding the x,y,z values of the pixel_id.
        //TODO: optimize the calculation of p.y and p.z
        //they will be the same most of the time for a given work item
        p.x = pixel_id % N;    
        p.y = ((pixel_id % Nsqr)-p.x)/N;
        p.z = (pixel_id - p.x - p.y*N)/Nsqr;

        for(i=0;i<otherSize;i++){
            //...
            //Some long calculation for 'result' involving otherValues[i] and p...
            //...
            //result[j] += ...
        }
    }
    //4 consecutive writes to BigMatrix will fall in the same cacheline (faster)
    BigMatrix[global_id] += result[0];
    BigMatrix[global_id + 1] += result[1];
    BigMatrix[global_id + 2] += result[2];
    BigMatrix[global_id + 3] += result[3];
}

__kernel void MyKernel(__global float * BigMatrix, __global float * otherArray)   {
int global_id = get_global_id(0) * 4; //The location of the first pixel in the 1D array
int pixel_id = global_id;

int local_id = get_local_id(0); //work item id within the group
int local_size = get_local_size(0); //size of group


float result[4]; //result cached for 4 global values
int i, j;
float3 p;
//Finding the initial x,y,z values of the pixel_id.
p.x = pixel_id % N;    
p.y = ((pixel_id % Nsqr)-p.x)/N;
p.z = (pixel_id - p.x - p.y*N)/Nsqr;

//cache the values here. same as above...

//now this work item can compute the complete result for pixel_id 
for(j=0;j<4;j++){
    result[j] = 0;
//increment the x,y,and z values instead of computing them all from scratch
    p.x += 1;
    if(p.x >= N){
        p.x = 0;
        p.y += 1;
        if(p.y >= N){
            p.y = 0;
            p.z += 1;
        }
    }

    for(i=0;i<otherSize;i++){
        //same i loop as above...
    }
}