Performance 使用bsxfun和GPU在matlab中矢量化嵌套循环_Performance_Matlab_Gpu_Vectorization_Bsxfun

Performance 使用bsxfun和GPU在matlab中矢量化嵌套循环

performance matlab

Performance 使用bsxfun和GPU在matlab中矢量化嵌套循环,performance,matlab,gpu,vectorization,bsxfun,Performance,Matlab,Gpu,Vectorization,Bsxfun,For循环似乎非常慢，所以我想知道下面显示的代码中的嵌套循环是否可以使用bsxfun进行矢量化，也许还可以引入GPU 代码 %// Paramaters i = 1; j = 3; n1 = 1500; n2 = 1500; %// Pre-allocate for output LInc(n1+n2,n1+n2)=0; %// Nested Loops - I for x = 1:n1 for y = 1:n1 num = ((n2 ^ 2) * (L1(i, i

For循环似乎非常慢，所以我想知道下面显示的代码中的嵌套循环是否可以使用

bsxfun

进行矢量化，也许还可以引入GPU

代码

%// Paramaters
i = 1;
j = 3;
n1 = 1500;
n2 = 1500;

%// Pre-allocate for output
LInc(n1+n2,n1+n2)=0;

%// Nested Loops - I 
for x = 1:n1
    for y = 1:n1
        num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
        LInc(x, y) = L1(x, y) + (num/denom);
        LInc(y, x) = LInc(x, y);
    end
end

%// Nested Loops - II
for x = 1:n1
    for y = 1:n2
        num = (n1 * n * L1(x,i)) + (n2 * n * L2(y,j)) - ((n1 * n2 * (L1(i, i) + L2(j, j) + 1)));
        LInc(x, n1+y) = num/denom;
        LInc(n1+y, x) = LInc(x, n1+y);
    end
end

编辑1:

和

denom

也可以假设为常量。

这里是矢量化的

CPU

和

GPU

代码，我希望我在

GPU

代码和以后的基准测试中至少使用了良好的实践

CPU代码 GPU代码标杆管理 GPU基准测试技巧取自

结果

结论

结果表明，矢量化GPU代码在更高的数据量下表现得非常好，从比矢量化CPU和原始代码都慢到比矢量化CPU代码快两倍。

如果您没有这样做，您应该预先分配LInc

LInc = zeros(n1,n2);

如果要对其进行矢量化，则不需要使用bsxfun对代码进行矢量化。我想你可以做类似的事情

x = 1:n1;
y = 1:n1;
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
                    LInc(x, y) = L1(x, y) + (num/denom);

然而，这段代码让我感到困惑，因为实际上，您正在多次覆盖LInc的值。不知道你的目标是什么，我很难再帮你了。上述代码可能不会返回与函数相同的值。

您有两个不同的索引：

x，y

和

i，j

，但是这里没有定义

i，j

。请包含更多描述

i，j

的代码，或者重写代码，以便创建可复制的结果。抱歉，忘记提及i=1和j=3，n1和n2也是定义的常数……您应该预先分配

LInc

什么是

denom

？我假设它是另一个常数？是的，我已经预先分配了顺序为n1+n2的LInc，而denom是另一个常数。你可能需要

meshgrid

或

ndgrid

或

repmat

来创建2D

num

。很好的一点，L1（x，I）和L1（y，I）可能需要被设置为2D，并且其中一个需要精确地转换！基本上，按照嵌套循环的设置方式，下三角部分将叠加到上三角部分。感谢各位，这些评论帮助我理解了bsxfun的工作原理，@Diwakar:你们的代码帮助我了解了bsxfun的工作原理，各位，非常感谢，我按照大家的建议实现了Samway，我的应用程序速度提高了近250倍….+1，因为我回答了一个格式如此糟糕的问题，并且让它变得很酷。在计时图上太棒了@rayryeng我看到了它的潜力，名字

bsxfun

足以把我拉进去：）感谢你的支持！！

%// Warm up GPU call with insignificant small scalar inputs, just in case
%// gputimeit doesn't do the same
temp1 = modp2(1,1,1,1,1,1,1,1); %// This is vectorized GPU code

i = 1;
j = 3;
n = 1000; %// Assumed
denom = 1e6;  %// Assumed

N_arr = [50 100 200 500 1000 1500]; %// array elements for N (datasize)
timeall = zeros(3,numel(N_arr));

for k1 = 1:numel(N_arr)
    N = N_arr(k1);
    n1 = N;  %// n1, n2 are assumed identical for less-complicated benchmarking
    n2 = N;

    L1 = rand(n1,n1);
    L2 = rand(n2,j);

    f = @() modp0(i,j,n1,n2,L1,L2,n,denom);%// Original CPU w/ preallocation
    timeall(1,k1) = timeit(f);
    clear f

    f = @() modp1(i,j,n1,n2,L1,L2,n,denom);%// Vectorzied CPU code
    timeall(2,k1) = timeit(f);
    clear f

    f = @() modp2(i,j,n1,n2,L1,L2,n,denom);%// Vectorized GPU(GTX 750Ti) code
    timeall(3,k1) = gputimeit(f);
    clear f
end

%// Display benchmark results
figure,hold on, grid on
plot(N_arr,timeall(1,:),'-b.')
plot(N_arr,timeall(2,:),'-ro')
plot(N_arr,timeall(3,:),'-kx')
legend('Original CPU','Vectorized CPU','Vectorized GPU (GTX 750 Ti)')
xlabel('Datasize (N) ->'),ylabel('Time(sec) ->')

LInc = zeros(n1,n2);

x = 1:n1;
y = 1:n1;
num = ((n2 ^ 2) * (L1(i, i) + L2(j, j) + 1)) - (n2 * n * (L1(x,i) + L1(y,i)));
                    LInc(x, y) = L1(x, y) + (num/denom);