Android 为浮点计算着色器原子操作_Android_Opengl Es_Synchronization_Compute Shader_Glsles

Android 为浮点计算着色器原子操作

android opengl-es synchronization

Android 为浮点计算着色器原子操作,android,opengl-es,synchronization,compute-shader,glsles,Android,Opengl Es,Synchronization,Compute Shader,Glsles,我正在使用计算机着色器来获得一个和值（类型为float），如下所示： #version 320 es layout(local_size_x = 640,local_size_y=480,local_size_z=1) layout(binding = 0) buffer OutputData{ float sum[]; }output; uniform sampler2D texture_1; void main() { vec2 texcoord(float(gl_LocalInvo

我正在使用计算机着色器来获得一个和值（类型为float），如下所示：

#version 320 es
layout(local_size_x = 640,local_size_y=480,local_size_z=1)
layout(binding = 0) buffer OutputData{
float sum[];
}output;
uniform sampler2D texture_1;
void main()
{
    vec2 texcoord(float(gl_LocalInvocationIndex.x)/640.0f,float(gl_LocalInvocationIndex.y)/480.0f);
    float val = textureLod(texture_1,texcoord,0.0).r;
//where need synchronize
    sum[0] = sum[0]+val;
//Here i want to get the sum of all val in texture_1 first channal
}

array_size = N
data = input_array

while array_size > 1:
   spawn pass with M = array_size/2 threads.
   thread M: out[M] = data[2*M] + data[2*M+1]
   array_size = M
   data = out

我知道有些原子操作像atomicAdd（），但不支持float paramater和barrier（），这似乎并不能解决我的问题。

也许我可以将float编码为int，或者有什么简单的方法来解决我的问题吗？

原子通常在性能方面非常差，特别是在大量线程并行访问的情况下，所以我不推荐它们用于此用例

为了保持这里的并行性，您确实需要某种多过程缩减策略。伪代码，如下所示：

#version 320 es
layout(local_size_x = 640,local_size_y=480,local_size_z=1)
layout(binding = 0) buffer OutputData{
float sum[];
}output;
uniform sampler2D texture_1;
void main()
{
    vec2 texcoord(float(gl_LocalInvocationIndex.x)/640.0f,float(gl_LocalInvocationIndex.y)/480.0f);
    float val = textureLod(texture_1,texcoord,0.0).r;
//where need synchronize
    sum[0] = sum[0]+val;
//Here i want to get the sum of all val in texture_1 first channal
}

array_size = N
data = input_array

while array_size > 1:
   spawn pass with M = array_size/2 threads.
   thread M: out[M] = data[2*M] + data[2*M+1]
   array_size = M
   data = out

这是一个简单的2:1缩减，所以提供了O（log2（N））性能，但您可以在每次传递时进行更多缩减，以减少中间存储的内存带宽。对于使用纹理作为输入的GPU来说，4:1是相当不错的（您可以使用textureGather，甚至可以使用简单的线性过滤器在一次纹理操作中加载多个样本）